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I         This  dissertation  proposes  a  different  approach  to  represent  broadband 

audio  signals.  Conventional  audio  coding  algorithms  such  as  the  MPEG 

perform  a  filter  bank  analysis  and  apply  a  psychoacoustically  designed 

1  ' 

dynamic  quantization  on  each  subband  signal.  On  the  other  hand,  the 
proposed  algorithm  performs  a  subband  analysis  on  input  audio  using  wavelet 
transforms,  and  the  resulting  subband  signals  are  represented  by  the  optimal 
codewords  from  the  predetermined  codebook.  By  carefully  selecting  the  base 
wavelet  and  the  analysis  tree  structure  the  subband  analysis  is  achieved 
according  to  the  critical  band  concept.  Overall,  the  system  is  based  on  the 
proven  idea  that  the  time  varying  specific  loudness  along  the  critical  bands 
contains  sufficient  information  to  reproduce  the  transparent  audio  output. 
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The  sample  audio  files  of  various  monotones  and  their  combinations  are 
processed  through  the  proposed  system.  The  outputs  are  virtually  transparent 
to  human  ears  psychoacoustically  even  though  actual  music  signals  need  some 
further  refinements.  The  overall  performance  of  the  codebook  based  audio 
coder  exhibits  a  great  potential.  The  current  system  is  not  optimized  for  a  high 
compression  ratio,  but  with  a  dynamic  quantizer  in  place  the  compression  is 
expected  to  be  high.  Furthermore,  the  output  coefficients  allow  an  easy 
implementation  of  the  psychoacoustical  ideas  including  masking  effects  since 
they  are  directly  related  to  the  frequency  components  of  a  given  input  audio 
frame. 

We  have  knowingly  undertaken  a  mammoth  task  since  many  tens  of 
man-years  were  put  into  developing  and  refining  toll-quality  speech  coders, 
and  thus  we  know  much  work  will  need  to  be  done  in  optimizing  the  codebook 
approach  for  high  fidelity  audio.  Nonetheless,  the  method  presented  here 
represents  a  first  step  in  this  direction  and  provides  a  demonstration  of 
mathematical  and  practical  feasibility  of  the  codebook  approach  as  well  as  a 
platform  for  further  refinements. 
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CHAPTER  I 
INTRODUCTION 

j  1.1    Fourier  Representation  of  Signals 

The  representation  of  signals  by  sums  of  complex  exponentials  of 
sinusoids  leads  to  convenient  solutions  to  problems  in  many  areas  of  science 
and  engineering.  It  often  provides  greater  insight  into  physical  phenomena 
than  is  available  by  other  means.  Such  Fourier  representations  are  commonly 
used  to  determine  the  response  of  linear  systems  to  a  superposition  of  complex 
exponentials  or  sinusoids.  As  an  example,  speech  communication  research  is 
one  area  where  the  concept  of  a  Fourier  representation  has  traditionally 
played  a  major  role.  The  production  of  a  steady  state  speech  sound  such  as  a 
vowel  or  fricative  simply  consists  of  a  hnear  system  excited  by  a  source  which 
is  either  periodically  or  randomly  varying  with  time.  If  such  a  model  meets 
certain  rules  in  general  [Rab78],  the  spectrum  of  the  output  would  be  the 
product  of  the  frequency  response  of  the  vocal  tract  system  and  the  spectrum 
of  the  excitation.  Based  on  the  same  principle,  when  the  input  signal  is 
represented  by  the  sum  of  complex  exponentials,  the  spectrum  of  the  output  of 
such  a  model  is  constructed  by  the  sum  of  the  output  to  each  complex 
exponential  input. 
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A  major  drawback  of  the  Fourier  representation  is  that  it  does  not 
admit  an  appropriate  representation  of  the  transient  characteristics  of 
signals.  This  can  be  seen  from  the  definition  of  the  discrete-time  Fourier 
transform  given  by 

oo 

X(e>«)=  X  ^(^)e-^'"".  (1.1) 

n  —  — oo 

The  time  interval  of  (-00, 00)  is  required  to  characterize  successfully  the  signal 
Also,  if  there  is  a  local  variation  such  as  an  impulse,  the  impact  of  such  a 
rapid  change  spreads  over  the  entire  frequency  spectrum.  The  localization  of 
sudden  changes  within  signals  is  required  in  order  to  process  them  with 
transient  variations. 

Another  disadvantage  of  Fourier  transform  is  that  the  frequency 
resolution  stays  constant  when  it  is  computed  on  a  digital  computer.  For  an 
example,  suppose  that  there  is  a  band-limited  signal  whose  sampling  rate  is 
20  KHz,  and  that  a  1024-point  fast  Fourier  transform  (FFT)  is  performed  on 
the  signal.  The  output  of  the  FFT  shows  the  frequency  spectrum  of  the  input 
signal  in  the  digital  domain.  Here,  the  frequency  resolution  between  each 
point  is  calculated  approximately  as  lOKHz  /  512  samples  =  19.53  Hz  /  sample. 
A  fine  resolution  is  usually  preferred  to  monitor  low  frequency  characteristics 
and  a  relative  coarse  resolution  is  allowed  to  scan  high  frequency  behaviors  of 
input  signal  since  human  sensors  work  in  the  logarithmic  scale  [Zwi90].  One 
way  to  adjust  the  frequency  resolution  in  the  digital  domain  is  to  choose  a 
different  number  of  points  on  the  FFT  or  change  the  sampling  rate  of  the 
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input  signal;  however,  this  will  also  affect  the  localization  characteristics  of 
the  Fourier  representation.  Therefore,  there  needs  to  be  a  trade-off  to 
determine  where  to  put  more  emphasis,  depending  on  the  properties  of 
applications. 

1.2  Audio  Compression 
Audio  has  been  one  of  the  most  intuitive  ways  of  communication  among 
human  beings.  As  today's  computers  become  more  capable  and  their  prices  get 
affordable,  a  good  quality  audio  on  computers  becomes  naturally  a  demanding 
requirement.  In  addition,  the  rapid  deployment  of  multimedia  computers  has 
established  that  a  high  quality  audio  on  personal  computers  is  a  necessity 
these  days. 

]  With  the  linear  Pulse  Code  Modulation  (PCM)  algorithm  the 
reproduction  of  a  high  quality  audio,  i.e.  the  CD-quality  audio,  requires 
705.6Kbits/sec  of  bandwidth  for  a  single  channel,  totalHng  1.4lMbits/sec  for 
stereo  sounds.  Today's  CD-ROM  drives  are  fast  enough  to  provide  the 
specified  amount  of  data  for  good  quality  sounds  directly  to  local  stand-alone 
computers.  Considering  that  reahstic  multimedia  stations  should  send  and 
receive  the  necessary  information  via  networks,  this  bandwidth  requirement, 
however,  becomes  practically  unrealistic  even  on  fast  networks  available. 
Even  the  number  of  users  and  the  amount  of  information  to  communicate 
among  them  have  exploded  exponentially  in  recent  years,  and  are  expected  to 
continue.  Therefore,  some  sort  of  compression  is  absolutely  necessary  for  a 
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smooth  sound  reconstruction,  reserving  extra  bandwidth  for  other 
functionality. 

1.3  Objectives 

'  Even  though  the  emergence  of  fiber  optics  has  eased  network 

congestion  issues,  efficient  compressions  of  audio  and  video  become  certain 
necessities  to  accommodate  ever  increasing  bandwidth  requirements  for 
exploding  networked  stations  as  well  as  stand-alone  personal  computers. 
;  Developing  a  decent  audio  coding  algorithm  usually  requires  a  group  of 

experts  for  a  considerable  period.  Several  commercial  companies  claim  that 
they  have  good  audio  compression  algorithms  but  disclose  only  specific 
technical  details.  In  this  study  we  want  to  propose  a  codebook  based 
broadband  audio  coding  scheme  which  is  somewhat  different  from  the 
algorithms  available  such  as  the  MPEG  [Bra94,  Sin93].  The  inherent 
hmitations  of  the  Fourier  transform  also  make  us  look  for  an  alternative  with 
desirable  properties. 

Due  to  the  limited  resources  and  time  only  portions  of  the  proposed 
scheme  are  implemented,  but  it  still  exhibits  a  great  potential  in  limited 
cases.  More  time  and  resources  are  expected  to  complete  the  whole 
implementation  of  the  proposed  scheme,  and  the  expectation  is  pretty  high 
due  to  the  satisfactory  preliminary  results. 
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1.4  Overview 

The  proposed  algorithm  is  based  on  many  ideas  from  different  areas. 
Brief  evolutionary  histories  of  the  related  topics  are  discussed  in  Chapter  II  of 
Literature  Survey.  We  argue  and  compare  other  specific  audio  coding  methods 
with  our  approach.  Chapter  III  contains  the  motivations  for  this  study.  The 
evolution  of  Fourier  transform  is  discussed  with  respect  to  the  time  frequency 
resolution,  and  the  result  is  compared  with  that  of  wavelet  transform.  It  also 
describes  an  efficient  and  effective  method  of  implementing  wavelet 
transforms,  called  the  multiresolution  representation,  with  various  base 

wavelets.  Since  the  psychoacoustics  play  an  important  role  in  modern  audio 

i 

coders,  a  brief  introduction  of  the  psychoacoustics  is  also  given.  The  codebook 
approach  in  speech  coding  algorithms  is  discussed,  with  their  evolution  from 
the  early  linear  predictive  coders  at  low  bit  rate.  At  the  end.  Chapter  III  has 
the  proposition  for  the  overall  scheme  based  on  the  principles  explained 
earlier. 

Chapter  IV  and  Chapter  V  explain  the  approach  and  simulations  taken 
in  this  study,  and  their  results  on  selected  input  signals,  respectively.  Chapter 
VI  lists  the  proposed  future  refinements  for  improved  performance. 

I 
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!   .  CHAPTER  II 

I  LITERATURE  SURVEY 

1 
1 

\ 
! 

2.1    Base  Wavelets 

i 

■  ;  Wavelet  theory  has  been  developed  as  a  unifying  framework  only 
recently,  even  though  similar  ideas  and  constructions  showed  up  as  early  as 
1910s  [HaalO].  The  name  "wavelet"  had  been  used  before  in  literatures  but  its 
current  meaning  is  due  to  J.  Goupillaud,  J.  Morlet,  and  A.  Grossman  [Gou84, 
Gro84].  In  the  context  of  geophysical  signal  processing  they  investigated  an 
alternative  to  local  Fourier  analysis  based  on  a  single  prototype  function,  and 
its  scales  and  shifts.  The  modulation  by  complex  exponentials  in  the  Fourier 
transform  is  replaced  by  a  scaling  operation,  and  the  notion  of  scale  replaces 
that  of  frequency.  The  simphcity  and  elegance  of  the  wavelet  scheme  were 
appeahng  and  mathematicians  started  studying  wavelet  analysis  as  an 
alternative  to  Fourier  analysis.  This  led  to  the  discovery  of  wavelets  which 
form  orthonormal  bases  for  square-integrable  and  other  function  spaces  by 
Meyer,  Daubechies,  Battle,  Lemarie,  and  others.  ,      , , 

■       Meyer  [Mey86]  constructed  h(x)  with  fast  decay  such  that  /i^ 
constitutes  an  orthonormal  basis  of  L^{K).  The  Meyer  basis  is  a  much  more 
powerful  tool  than  the  Haar  basis  since  it  is  a  polynomial  function  with  a 
faster  decay.  Another  orthonormal  basis  of  wavelets  was  constructed  by 
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Lemarie  and  Meyer  [Lem86]  and  Battle  [Bat87,  Bat88].  In  their  construction 
the  function  h(x)  has  exponential  decay  which  is  faster  than  any  power  in 
Meyer's  basis.  The  base  wavelets  developed  so  far  decay  quickly  to  zero,  but 
never  reach  zero  except  for  the  Haar  base.  In  late  1980s,  Daubechies 
developed  ways  to  compute  base  wavelets,  which  actually  become  zero  outside 
their  supporting  interval  [Dau88].  This  new  class  of  base  wavelets  with 
compact  support  provides  many  additional  properties  to  the  conventional 
ones. 

A  formalization  of  such  constructions  by  Mallat  and  Meyer  created  a 
framework  for  wavelet  expansions  called  multiresolution  analysis  [Mal89a, 
Mal89c]  and  estabhshed  hnks  with  methods  used  in  other  fields.  Also,  the 
wavelet  construction  by  Daubechies  is  closely  connected  to  filter  bank 
methods  used  in  digital  signal  processing  [Mal89b,  Sin93]. 

'  2.2  Psvchoacoustics 

The  human  auditory  system  is  often  modeled  as  a  filter  bank,  which  is 
based  on  critical  bands  [Sch70].  The  key  features  of  such  a  spectral  view  of 
hearing  are  a  logarithmic  bandwidth  behavior  of  the  filter  and  the  masking 
properties  of  dominant  sounds  over  weaker  ones  within  a  critical  band  and 
over  nearby  bands.  The  critical  bands  can  be  seen  as  pieces  of  the  spectrum 
that  are  considered  as  an  entity  in  the  auditory  process.  While  the  masking 
properties  are  very  complex  and  only  partly  understood,  the  basic  concepts 
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can  be  successfully  used  in  an  audio  compression  system  such  as  the  MPEG 
[Bra94]. 

Human  ears  perceive  sounds  somewhat  differently  from  most 
measuring  equipment.  As  an  example,  a  white  noise  with  a  fixed  sound 
pressure  level  (SPL)  is  usually  perceived  as  more  annoying  than  a  single 
frequency  noise  with  the  same  SPL.  Such  differences  raise  a  requirement  for  a 
new  unit  which  can  represent  the  relative  sensations  effectively,  and 
therefore,  the  specific  loudness  along  the  critical  bands  is  defined.  A  sound  is 
then  described  with  the  specific  loudness  for  each  critical  band,  and  such 
representation  contains  sufficient  information  to  reproduce  the  original 
sound. 

A  perceptual  coder  attempts  to  keep  quantization  noise  just  below  the 
level  where  it  would  be  noticeable  [Jay92].  Permissible  quantization  noise 
levels  have  to  be  calculated  according  to  the  psychoacoustical  principles,  and 
the  number  of  bits  allocated  are  determined  accordingly. 

2.3  Optimization 
There  are  many  known  algorithms  which  can  find  minimum  points  of 
given  cost  functions.  One  of  the  most  popular  methods  is  the  Steepest  Descent. 
At  each  iteration  the  next  solution  point  moves  toward  the  inverse  direction  of 
the  gradient  of  the  current  solution, 

^jfe  +  i  =  ^k-i^k-^k'  (2.1) 
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while  \if^  is  determined  appropriately.  When  the  performance  surface  is 

convex,  this  method  is  guaranteed  to  give  the  global  minimum  after  a  number 
of  iterations.  However,  for  this  method  it  is  also  well  known  that  the 
converging  speed  gets  very  slow  especially  near  the  solution.  Another  well 
established  algorithm  is  the  Newton  method.  It  modifies  the  current  solution 
point  toward  the  direction  of  the  actual  solution  rather  than  that  of  the 
current  gradient.  Accordingly,  the  next  solution  point  is  determined  by 

Xk.i=Xk-\ikH-'Vk>  (2.2) 
while  H  is  the  Hessian  matrix.  This  method  provides  a  very  fast  converging 
speed  toward  the  actual  solution;  however,  it  requires  a  considerable  amount 
of  computations  to  come  up  with  the  Hessian  matrix  and  its  inverse  at  every 
iteration.  In  many  practical  systems  the  analytical  Hessian  matrices  are  quite 
often  unavailable.  Even  if  they  are  available  in  some  limited  cases,  the 
computations  are  often  too  costly. 

There  are  several  well  known  quasi-Newton  methods  which  ease  the 

I 

computing  requirement  of  the  Newton  method  considerably.  They  construct 
the  inverse  Hessian,  or  an  approximation  of  it,  using  information  gathered  as 
the  descent  process  progresses.  The  earhest  scheme  was  proposed  by  Davidon, 
Fletcher  and  Powell  [Dav59,  Fle63,  FleSO]  and  another  algorithm,  called  the 
BFGS  method  [Bro70,  Fle70,  Gol70,  ShaVO],  was  invented  by  a  group  of 
scientists.  Usually  these  methods  exhibit  better  performance  in  terms  of  the 
overall  speed  and  performance  since  the  approximation  of  the  inverse  Hessian 
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matrix  converges  very  closely  to  its  analytical  value  after  a  relatively  small 
number  of  iterations. 

;  2.4    Broadband  Audio  Coding 

!  For  the  last  several  decades  the  US  Department  of  Defense  (DoD)  has 

taken  the  lead  in  the  standardization  efforts  for  low-bit  rate  toll-quality 
speech  coding  algorithms,  particularly  for  use  in  the  DoD  Secure  Telephone 
Units  (STU).  Linear  predictive  coding  (LPC)  has  been  the  algorithm  of  choice 

i 

for  low  bit  rate  speech  coders  and  many  LPC-based  algorithms  including 
LPC  10  have  been  developed  and  deployed  successfully.  Recently,  the  DoD  has 
chosen  the  Codebook  Excited  Linear  Prediction  (CELP)  for  their  next 
generation  STU  at  4.8Kbps  [Kem89].  The  robust  and  soHd  performance  of 
CELP  has  made  it  the  natural  choice  for  the  digital  cellular  phone  standard  in 
the  US  and  other  countries.  The  superior  performance  of  CELP  at  such  a  low 
bit  rate  partly  stems  from  the  unique  method  of  the  "codebook  approach." 

On  the  other  hand,  the  development  of  broadband  audio  coding 
algorithms  has  been  the  domain  of  private  industries.  An  exhaustive  search  of 
the  audio  coding  and  compression  literature  reveals  that  a  few  large 
companies  have  their  own  proprietary  audio  coding  algorithms,  whose 
technical  details  are  usually  not  fully  disclosed  in  an  attempt  to  protect  the 
huge  investment  of  man-power  and  resources.  The  first  open  standardization 
effort  for  audio  coding  algorithms  was  initiated  by  the  Moving  Pictures  Expert 
Group  (MPEG),  a  committee  of  the  CCITT.  The  emerging  need  to  store  and 
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transmit  information  with  high  fidehty  sound  has  led  to  the  adoption  of  the 
proposed  MPEG  compression  algorithm  as  a  primary  audio/video  coding 
standard  within  the  computer  industry  as  well  as  in  the  entertainment 
industry. 

The  MPEG  audio  compression  [Bra94]  is  based  on  the  coding  algorithm 
called  the  Masking-pattern  Universal  Subband  Integrated  Coding  and 
Multiplexing  (MUSICAM)  [Deh91,  Sto90].  This  system  uses  a  32 -band 
uniform  filter  bank,  obtained  by  modulating  a  512-tap  prototype  lowpass 
filter.  One  reason  for  choosing  such  a  filter  band  is  that  it  has  a  reasonable 
computational  complexity  since  it  can  be  implemented  with  a  polyphase  filter 
followed  by  a  fast  transform.  In  parallel  to  the  filter  bank,  a  fast  Fourier 
transform  is  used  for  spectral  estimation.  Based  on  the  power  spectrum,  a 
masking  curve  is  calculated  and  quantization  noise  is  then  allocated  in  the 
various  subbands  according  to  the  masking  function.  This  allocation  is 
performed  on  a  small  block  of  subband  samples  (typically  12).  Then,  the  scale 
factor  and  the  quantization  step  are  calculated  for  each  block.  They  are 
transmitted  as  side  information  together  with  the  quantized  samples  [Hyu93]. 
Even  though  MPEG  audio  provides  an  acceptable  quality,  the  filter  bank  has 
a  fixed  size  frequency  window  and  this  mismatches  the  actual  perception  of 
audio  by  humans. 

Many  researchers  have  been  working  on  the  development  of  efficient 
high  quality  audio  compression,  and  the  rapidly  increasing  number  of  related 
pubhcations  shows  different  approaches  in  recent  years.  An  alternative 
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approach  to  the  problem  is  to  use  WT  which  are  known  to  be  superior  in  these 
appHcations  to  Fourier  transforms.  A  pioneering  effort  in  this  regard  is  the 
work  of  Tewfik,  Sinha  and  Ah  to  develop  an  audio  coding  algorithm  based  on 
wavelet  transforms  with  Daubechies  base  wavelets  [Sin93,  Tew93,  Ali93].  The 
algorithm  described  by  Sinha  and  Tewfik  [Sin93]  performs  a  non-symmetric 
wavelet  transform  on  the  audio  input  signal  first,  and  the  resulting 
coefficients  are  subjected  to  dynamic  psychoacoustical  quantization.  The 
quantization  routine  selectively  ranks  the  audio  components  the  human  ear 
can  perceive,  based  on  psychoacoustical  masking  effects,  and  allocates  an 
appropriate  number  of  bits  to  emphasize  their  relative  importance.  This 
dynamic  bit  allocation  helps  reduce  the  bandwidth  usage  even  further. 

An  extensive  comparative  study  of  several  available  audio  coders 
[Bra94,  Deh91,  Sin93,  Sto90,  Tew93]  reveals  a  few  common  features.  They 
perform  a  filter  bank  analysis  based  on  their  base  functions  (either  single 
sinusoid  or  wavelet),  followed  by  a  quantization  of  the  transformed 
coefficients  for  each  subband.  Throughout  the  procedure  the  fundamental 
concepts  from  the  psychoacoustical  modeling  help  to  eliminate  many 
inaudible  frequency  components  and  identify  and  exploit  the  imperceptible 
quantization  noises. 

The  proposed  system  in  this  work  takes  a  somewhat  different  approach 
in  representing  the  wavelet  transformed  coefficients.  It  computes  the  energy 
distribution  of  current  audio  frames  and  matches  them  with  a  combination  of 
the  predetermined  codewords.  Basically  the  subband  signals  are  replaced 
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with  the  carefully  designed  codewords  and  their  combinations.  The 
psychoacoustical  ideas  are  adopted  in  the  procedure,  but  their  interpretation 
is  quite  different.  The  details  are  described  in  the  subsequent  chapters. 

Instead  of  proposing  a  full  coding  algorithm,  other  researchers  have 
suggested  various  ideas  for  improving  the  performance  of  known  methods; 
Philippe  and  others  [Phi95]  and  Kudumakis  and  Sandler  [Kud95]  came  up 
with  an  algorithm  which  provides  a  mechanism  for  choosing  base  wavelets 
which  yield  improved  overall  performance.  Boland  and  Deriche  [Bol95] 

I 

developed  a  hybrid  algorithm  with  the  multipulse  LPC  method.  Also,  Cheung 
and  Lim  [Che95],  and  Vargas  [Var93]  worked  on  the  Extended  Lapped 
Transform  (ELT)  based  hybrid  method  in  order  to  improve  performance. 

j  However,  an  exhaustive  search  of  the  literature  on  audio  compression 

I 

and  WT  indicates  that  there  are  very  few  full  audio  coding  algorithms 
proposed  based  on  the  wavelet  transforms,  apart  from  the  work  reported  by 
Sinha  and  Tewfik  [Sin93]  and  Tewfik  and  Ali  [Tew93]  where  encouraging 
results  were  obtained  from  a  direct  psychoacoustically  sensitive  quantization 
of  the  WT  coefficients. 

In  this  research  we  seek  to  exploit  the  desirable  properties  of  WT  for 
audio  coding  and  compression  and  combine  these  with  the  codebook  approach 
to  compress  and  delete  redundancies  which  proved  to  be  efficient  for  toU- 
quahty  voice  (in  the  CELP  algorithm).  In  this  process  we  have  knowingly 
undertaken  a  mammoth  task  since  many  tens  of  man-years  were  put  into 
developing  appropriate  codebooks  for  toll-quality  speech  and  thus  we  know 
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much  work  will  need  to  be  done  in  optimizing  the  codebooks  for  high  fidelity 
audio.  Nonetheless,  the  method  presented  here  represents  a  first  step  in  this 
direction  and  provides  a  demonstration  of  mathematical  and  practical 
feasibility  of  the  approach  as  well  as  a  platform  for  further  refinements  of  the 
codebooks. 


CHAPTER  III 
MOTIVATION 


This  research  is  based  on  many  ideas  borrowed  from  the  theories  of 
other  areas,  which  are  then  applied  and,  in  some  cases,  modified  for  our  need. 
Some  ideas  are  originated  from  close  comparisons  of  the  mathematical 
properties  between  Fourier  transforms  and  wavelet  transforms,  and  the 
codebook  approach  was  motivated  by  the  successful  implementation  and 
deployment  of  a  similar  speech  coding  algorithm.  Also,  ideas  related  to  the 
psychoacoustics  provide  an  essential  foundation  for  the  research. 

This  chapter  explains  the  necessary  portions  of  the  ideas  which 
constitute  the  background  for  the  work.  They  often  help  determine  the 
direction  of  the  research  when  it  needs  guidance. 

1  ; 

3.1  Fourier  Transform  and  its  Time-Freauencv  Localization 

3.1.1  Variations  in  Fourier  Transforms 

The  subject  of  Fourier  representation  is  one  of  the  oldest  subjects  in 
mathematical  analysis  and  is  of  great  importance  to  mathematicians  and 
engineers  alike.  The  importance  of  the  Fourier  transform  stems  not  only  from 
the  significance  of  its  physical  interpretations,  such  as  frequency  analysis  of 
signals,  but  also  from  the  fact  that  Fourier  analytic  techniques  are  extremely 
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powerful,  especially  when  the  steady-state  characteristics  of  a  signal  are 
required. 

The  Fourier  transform  /"  of  a  function  f{t)  is  defined  as 

oo 

/(CO)  =  \  f{t)e-''^^dt.  (3.1) 

— oo 

From  a  practical  point  of  view,  a  Fourier  transform  is  the  Fourier  integral  of 
some  function  /  defined  on  the  real  line  IR.  When  f  is  thought  of  as  an  analog 
signal,  then  its  domain  R  is  called  the  continuous  time  domain.  In  this  case, 

the  Fourier  transform  /  of  /  describes  the  spectral  behavior  of  the  signal  f. 

- 

Since  the  spectral  information  is  given  in  terms  of  frequency,  the  domain  of 

definition  of  the  Fourier  transform  /,  which  is  again  R,  is  called  the  frequency 
domain. 

!  Since  the  time  indices  of  the  integral  in  (3.1)  are  infinite,  full  knowledge 

of  the  signal  in  the  time  domain  must  be  available  to  study  the  spectral 
information  of  an  analog  signal  from  its  Fourier  transform.  In  addition,  if  a 
signal  is  altered  in  a  small  neighborhood  of  some  time  instant,  then  the  entire 
spectrum  is  affected,  which  also  means  that  the  frequency  response  at  a 
specific  time  instant  can  not  be  obtained  precisely.  Indeed,  in  the  extreme 
case,  the  Fourier  transform  of  the  delta  function  b{t  -  Iq)  ,  with  the  support  at 

a  single  point  ^q,  is  e"''"'",  which  certainly  covers  the  whole  frequency  domain. 
Hence,  in  many  applications  such  as  analysis  of  non-stationary  signals  and 
real-time  signal  processing  in  which  transient  frequency  characteristics  of  the 
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input  signal  are  required,  the  formula  of  Fourier  transform  alone  is  quite 
inadequate. 

Gabor  observed  the  deficiency  of  the  formula  of  Fourier  transform  in 
time-frequency  analysis  and  introduced  a  time-localization  window  function 
g(t-b),  where  the  parameter  b  is  used  to  move  the  time  window  in  order  to 
cover  the  whole  time  domain,  for  extracting  local  information  of  the  Fourier 
transform  of  the  signal  [Gab46].  In  fact,  Gabor  used  a  Gaussian  function  for 
the  window  function  g.  Since  the  Fourier  transform  of  a  Gaussian  function  is 
again  a  Gaussian,  the  Fourier  transform  of  the  window  function  is  localized 
when  the  window  function  is  a  localized  Gaussian  function. 

For  the  last  several  decades  the  idea  was  further  refined  by  other 
researchers  and  termed  as  "Short-Time  Fourier  Transform  (STFT)."  There 
might  be  many  different  or  similar  properties  of  the  Gabor  transform  and 
STFT  to  compare.  In  this  section,  however,  the  size  and  shape  of  the 
resolution  windows  of  each  transform  are  computed  in  the  time-frequency 
domain,  which  then  will  be  compared  to  characterize  each  transform  in  terms 
of  resolution. 

3.1.2  Gabor  Transform 

The  formula  (3.1)  alone  is  not  very  useful  for  extracting  information  of 
the  spectrum  from  local  observation  of  the  signal  /  since  the  interval  of  the 
integration  is  (-«>,  °o) .  There  should  be  an  algorithmic  method  to  monitor 
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rapid  changes  in  the  signal.  The  optimal  window  for  time-localization  is 
achieved  by  adopting  any  Gaussian  function 

<2 


where  a  >  0.  For  any  fixed  value  of  a  >  0,  the  Gabor  transform  of  an  /  e  L%K) 
is  defined  by 

oo 

(^6/)(«)=  !ie-^'''f(t))g^(t-b)dt,  (3.3) 

■1  • :  ■ 

that  is,  (^"/)(co)  localizes  the  Fourier  transform  of /"  around  t  =  b,  and  the 

width  of  the  window  is  determined  by  the  constant  a.  If  we  compute  the 
integral  of  (3.3)  with  (o  =  0  and  a  =  1/4,  we  have 

oo  oo 

lg^it-b)db  =  jg^{x)dx  =  1.  (3.4) 

Therefore, 

'  oo 

J(^"/)((D)d6  = /(CO)        coelR.  (3.5) 

— oo 

j 

Consequently,  {^"/"ift  e  [R}  of  Gabor  transforms  of  /  decomposes  the  Fourier 

transform  /  of /"exactly,  to  give  its  local  spectral  information. 

Our  main  objective  in  this  section  is  to  compute  the  size  and  shape  of 
the  resolution  window  of  the  Gabor  transform  in  the  time-frequency  domain. 
The  width  of  the  time  window  is  considered  first.  In  order  to  choose  an 
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appropriate  width  of  the  window  function,  the  notion  of  standard  deviation,  or 
root  mean  square  (RMS)  duration,  is  employed,  which  is  defined  by 


j  x^glix)dx 


1/2 


(3.6) 


Note  that  since  is  an  even  function,  its  center  is  0,  and  hence,  A„  agrees 
with  the  general  notion  of  radius.  In  particular,  the  width  of  the  window 
function  g„  is  2A_  .  For  each  a  >  0,  A_  =  Ja.  Therefore,  the  width  of  the 

Ba  So 

window  function  ^„  is  2ja  [Chu92]. 

Another  interpretation  of  the  same  Gabor  transform  Q'^f  in  (3.3)  is 
required  to  find  the  width  of  the  frequency  window.  By  setting 


Gljt)^ei^^g^it-b), 


(3.7) 


we  have 


(^"/)(to)  =  {f,GlJ  =  jf{t)Gl^{t)dt. 


(3.8) 


In  other  words,  instead  of  considering  Q^f  as  locahzation  of  the  Fourier 
transform  /,  (3.8)  may  be  interpreted  as  windowing  the  function  /  by  using  the 
window  function  G'^^^  in  (3.7).  This  interpretation  also  provides  a  convenient 

comparison  of  the  Gabor  transform  with  the  integral  wavelet  transform  in 
Section  3.2.  One  advantage  of  (3.8)  is  that  the  Parseval  Identity  can  be  applied 
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to  relate  the  Gabor  transform  of  /  with  the  Gabor  transform  of  /.  In  fact,  the 
following  can  be  derived 


g-i6co 


(3.9) 


J(e^^Y(Tl))5l/4a(Tl-<0)^^Tl 


— oo 


2JKa 


since 


I   :  =  e-'^(^-®)e-«(^ (3.10) 

If  we  rewrite  (3.9),  it  gives  the  same  result,  but  from  a  different  view  point; 

I 

oo  oo 

j  (e-i'^'fit))g^{t-b)dt  =  [Je-^'"")^  j(e^«'Y(Tl))^j_(il-(o)dTi.  (3.11) 

-oo  _oo  4a 

The  interpretation  of  (3.9)  and  (3.11)  shows  an  interesting  result.  The 
windowed  Fourier  transform  of /with  window  function     at  i  =  6  agrees  with 

the  window  inverse  Fourier  transform  of  /  with  window  function  gi/^^  at  Tj  = 
(0  except  the  multiphcative  term  at  front.  Since  A„  =  Tot  from  the  above,  the 

Sa 

multiplication  of  the  widths  of  these  windows  becomes 

:  (2AJ(2A^_)  =  2.  (3.12) 

The  Cartesian  product 
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(3.13) 


of  these  two  windows  is  called  a  rectangular  time-frequency  window.  It  is 
usually  represented  in  the  time-frequency  domain  to  show  what  the  window 


in  Figure  3.1.  Observe  that  the  widths  of  the  time-frequency  window  are 
unchanged  while  the  window  moves  along  the  frequency  axis  and  the  time 
axis.  It  gives  a  relative  freedom  to  applications  which  require  the  spectral 
information  at  a  neighbor  of  some  time  instance.  However,  it  still  restricts  the 
application  of  the  Gabor  transform  to  study  signals  with  unusually  high  and 
low  frequencies  mixed  together.  j 

3.1.3  Time-Freauencv  Localization  of  STFT 

I  The  Gabor  transform  is  simply  a  Fourier  transform  with  any  Gaussian 
function  as  the  window  function.  Any  non-Gaussian  functions  may  also  be 
adopted  as  the  window  if  they  provide  different  properties  such  as 
computational  efficiency  or  convenience  in  implementation  or  other 
application- specific  interests. 

For  a  non-trivial  function  w  e  L2([R)  to  qualify  as  a  window  function,  it 
must  satisfy  the  requirement  that 


looks  like.  The  width  of  the  time  window  is  2 Toe,  and  the  width  of  the 


tw{t)e  L^(IR). 


(3.14) 
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Figure  3.1    The  resolution  window  of  Gabor  transform. 
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The  Gabor  transform  can  be  generalized  to  any  windowed  Fourier  transform 
by  using  a  function  w  that  satisfies  (3.14)  as  the  window  function; 


(3.15) 


Hence,  by  setting 


we  have 


(3.16) 


iWbf)((o)  =  {f,Wt,J  =  j  f(t)W,,Jt)dt. 


(3.17) 


Equation  (3.17)  is,  therefore,  defined  as  the  short-time  Fourier  transform 
(STFT)  since  w  e  L^{M.),  and  both  w  and  its  Fourier  transform  w  satisfy  (3.14). 

In  general,  for  any  w  e  L^W)  that  satisfy  (3.14),  the  center  and  radius  of 
w  are  defined,  respectively,  by 


X*  = 


i^^^  •  J  t\w(t)\^dt,  and 


(3.18) 


A..,= 


j  it-x*)^\wit)\^dt 


1/2 


(3.19) 


Our  interest  in  this  section  is  to  find  the  time  and  frequency  width  of  the 
resolution  window  of  the  STFT  in  the  time-frequency  domain.  In  (3.17) 
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{Wbf)((ii)  together  with  (3.18)  and  (3.19)  gives  local  information  of  fin  the 
time  window 

[x*  +  b-  A^,  x*  +  b  +  AJ.  (3.20) 

Therefore,  the  value  2A^  is  used  to  represent  the  width  of  the  time  window 
function  w.  Suppose  that  the  window  w  satisfies  the  conditions  in  (3.14).  Then, 
we  can  determine  the  center  co*  and  radius  A^^^ ,  of  the  window  function  i2) 
respectively,  resulting  in  the  width  of  frequency  window.  By  setting 

^6,co(Tl)  =  2^W^6,a,(Tl)  =  [^Je-^'""M;(Ti-co),  (3.21) 

which  is  also  a  window  function  with  center  at  to*  +  (O  and  radius  equal  to  A^ , 
we  have,  by  the  Parseval  Identity, 

!  (^fc/)(«)  =  (A^^b.o))  =  <An.(o>-  (3.22) 

Hence,  (^6/)(co)  also  gives  local  spectral  information  of  /in  the  frequency 
window. 

;  [co*  +  a)-A^,  Q)*  +  0)  +  A^].  (3.23) 

Consequently,  by  choosing  any  window  w  such  that  both  w  and  w  satisfy 
(3.14),  we  have  a  time-frequency  window 

[:»;*  +  6-A,^,  x*  +  6  +  AJx[co*  +  (o-A^,  co*  +  to  +  A^]  (3.24) 

with  the  width  2A^  in  the  time  domain  and  the  width  2A^  in  the  frequency 

domain,  resulting  in  the  constant  window  area  of  4A,.,A  -  .  The  width  of  the 
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time-frequency  window  remains  unchanged  for  localizing  signals  with  both 
high  and  low  frequencies.  Consequently,  the  resolution  window  of  the  STFT  in 
the  time-frequency  domain  should  look  similar  with  that  of  the  Gabor 
transform  in  Figure  3.1. 

For  accurate  time-frequency  localization,  one  chooses  a  window 
function  w  such  that  the  time-frequency  window  has  sufficiently  small  area  of 
4A^A^ .  We  have  already  seen  that  if  w  is  any  Gaussian  function,  then  the 
window  area  is  2.  The  question  is  now  whether  a  smaller  area  can  be  achieved 
than  4Aj^A^ .  It  has  been  known  as  the  uncertainty  principle  that  a  window 

with  smaller  size  than  or  equal  to  that  of  the  Gaussian  functions  does  not 
exist  [Chu92,  Dau88,  Rio91].  Therefore,  the  Gabor  transform  is  the  STFT  with 
the  smallest  time-frequency  window.  Just  like  the  Gabor  transforms,  the 
STFT  is  not  suitable  for  analyzing  signals  with  both  very  high  and  very  low 
frequency  components. 
I 

3.1.4  Comparisons  with  Actual  Data-An  Example 

In  the  following  example,  the  several  STFTs  are  apphed  to  a  signal 
with  different  combinations  of  time  resolutions.  Since  the  area  of  the 
resolution  window  of  the  STFTs  stays  constant  at  4Ay^A^,  the  frequency 

resolution  will  vary  accordingly.  Then,  the  spectrograms  of  the  result  of  each 
combination  are  plotted  in  the  time-frequency  domain. 
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Suppose  a  signal  f{n)  is  given  as 
f{n)  =  sin(27t/iMT)  +  sin(27t/2^^  +  2.5(5(ai  -  m^)  +  S(n  -  mg))  (3.25) 

where  T=  1/8000,  A  =  500Hz,  =  1500Hz,  mi  =  300,  and  /ng  =  370.  The  5„  ^  is 
defined  as 


The  original  signal  for  the  first  400  samples  is  plotted  in  Figure  3.2  with  clear 
impulses  at  the  corresponding  locations.  The  time  difference  between  two 
impulses  is  just  70  samples. 

Figure  3.3  shows  the  spectrogram  when  the  length  of  the  time  window 
is  32  samples.  The  horizontal  axis  represents  the  number  of  frames;  therefore, 
the  corresponding  locations  of  the  impulses  can  be  observed  rather  precisely 
because  the  window  length  is  relatively  small.  The  vertical  axis  is  the  digital 
domain  representation  of  the  analog  frequency,  and  the  maximum  digital 
frequency  is  4000Hz  due  to  the  Shannon's  sampHng  theory.  Since  the  width  of 
the  time  resolution  is  a  lot  smaller  than  the  time  gap  between  the  impulses, 
the  picture  clearly  shows  the  rapid  changes  in  the  signal,  two  impulses  in  this 
case,  in  the  time  domain.  However,  the  steady-state  components  of  the 
frequency  of  500Hz  and  1500Hz  are  shown  rather  unclear  due  to  the  short 
time  period. 

Figure  3.4  shows  the  spectrogram  when  the  time  window  length 
becomes  64  samples.  Likewise,  its  horizontal  axis  also  represents  the  number 
of  frames,  and  its  vertical  axis  shows  its  digital  frequency.  In  this  case,  the 
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when  n  =  m 
otherwise 


(3.26) 
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Figure  3.2    The  first  400  samples  of  the  original  signal. 

There  are  two  distinct  impulses  at  300  and  370. 
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Figure  3.3    The  spectrogram  of  the  original  signal  with  frames  of  32  samples. 

A  frame  consists  of  32  samples  without  any  overlap.  We  can 
clearly  see  the  locations  of  the  rapid  changes  in  the  signal.  How- 
ever, the  steady-state  frequency  components  of  500Hz  and 
1500Hz  can  not  be  determined  distinctively. 
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Figure  3.4    The  spectrogram  of  the  original  signal  with  frames  of  64  samples. 

A  frame  is  composed  of  64  samples  without  overlap.  The  changes 
in  the  time  domain  are  no  longer  distinct.  There  is  a  noticeable 
change  on  the  frequency  domain.  Now,  we  can  roughly  guess  the 
components  of  the  steady-state  frequency. 
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width  of  the  time  window  is  close  to  the  time  difference  between  the  impulses. 
Therefore,  the  distinction  is  not  as  clear  as  that  of  Figure  3.3.  However,  the 
separation  of  the  steady-state  frequency  components  of  500Hz  and  1500Hz 
gets  clearer. 

Figure  3.5  has  the  spectrogram  of  the  signal  f{n)  with  the  window 
length  of  128  samples.  Its  horizontal  and  vertical  axes  represent  the  number 
of  frames  and  the  digital  representation  of  the  frequency,  respectively. 
Comparing  Figure  3.5  with  Figure  3.3  and  Figure  3.4,  the  two  steady-state 
frequency  components  are  clearly  visible  in  Figure  3.5  within  a  fairly 
reasonable  resolution.  However,  its  time  resolution  is  far  worse  than  that  of 
Figure  3.3;  hence,  the  locations  of  the  two  impulses  are  not  clear  at  all.  It  is 
practically  impossible  to  tell  the  number  of  impulses  in  the  input  signal. 

As  Figure  3.3,  Figure  3.4,  and  Figure  3.5  show,  there  should  be  a  trade- 
off between  the  time  resolution  and  the  frequency  resolution  depending  on  the 
specific  requirements  of  the  apphcations. 
I 

3.2  Wavelet  Transform  and  its  Time-Frequencv  Localization 

3.2.1  Wavelet  Transform-Definition 

Researchers  and  mathematicians  have  come  up  with  a  new  definition, 
called  the  wavelet  transform  (WT),  by  imposing  some  additional  conditions  on 
the  STFT.  In  order  to  compare  the  major  properties  of  the  WT  with  those  of 
the  previous  ones,  the  resolution  window  in  the  time-frequency  domain  should 
be  computed. 


Figure  3.5    The  spectrogram  of  the  original  signal  with  frames  of  128  sam- 
ples. One  frame  is  now  128  samples  long.  The  frequency  resolu- 
'  tion  on  the  vertical  axis  clearly  allows  us  to  estimate  the 

frequencies  of  the  steady-state  components.  However,  the  rapid 
time  changes  in  the  time  domain  are  not  clear  at  all. 
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In  analyzing  a  signal  with  an  STFT,  the  resolution  window  in  the  time- 
frequency  domain  stays  unchanged  regardless  of  the  status  of  the  input 
signal.  Unlike  the  variations  of  the  Fourier  transforms,  the  WT  is  known  to 
provide  a  flexible  time-frequency  window  which  automatically  narrows  its 
time  window  when  observing  high-frequency  characteristics,  and  widens  its 
time  scale  when  studying  low-frequency  properties. 

For  any  \(/  e  L'^(R),  it  must  satisfy  the  "admissibilitj^'  condition, 


to  be  a  "base  wavelet."  Relative  to  every  base  wavelet  \\f ,  the  integral  wavelet 
transform  (IWT)  on  UiR)  is  defined  by 


where  a,  b  e  R  with  a^O.  If,  in  addition,  both  \|/  and  \|/  satisfy  the  conditions 
to  be  a  window  for  the  STFT  such  as  (3.14),  then  the  basic  wavelet  \|/  provides 
a  time-frequency  window  with  finite  area  given  by  4A  ■  A~ .  Under  this 


additional  assumption,  it  follows  that  \j/  is  a  continuous  function,  so  that  the 
finiteness  of      in  (3.27)  implies  \|f(0)  =  0,  or  equivalently. 


(3.27) 


(3.28) 
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j\\f{t)dt  =  0.  (3.29) 

— oo 

This  is  another  property  that  the  \\f  should  possess  to  become  a  wavelet. 

3.2.2  Time-Freauencv  Window  of  Wavelet  Transform 
!  By  setting 

;  =  .  (3.30) 

the  IWT  defined  in  (3.28)  can  be  written  as 

iW^f)ib,a)  =  (f,x^!i,.J.  (3.31) 

Then,  the  center  and  radius  of  the  window  function  \|/  are  given  by  t*  and  , 
respectively,  and  the  function  vi/^.^  is  a  window  function  with  center  at  b  +at* 
and  radius  equal  to  aA^ .  Hence,  the  IWT  gives  local  information  of  an  analog 
signal  /  with  a  time  window 

;  [b  +  at*  -  aA^,  b  +  at*  +  aA^] .  (3.32) 

This  window  narrows  for  small  values  of  a  and  widens  for  large  a. 

In  order  to  determine  the  size  and  shape  of  the  resolution  window  in  the 

'i 

time-frequency  domain  the  frequency  window  is  now  required.  It  can  be 
obtained  by  taking  the  Fourier  transform  of  (3.30) 

! 

oo 

2^vj/6;a(co)  =  •  J  e-'^'ii/j^— Jrff  =  ^^e-<''«i|/(a(o) ,  (3.33) 
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and  suppose  that  the  center  frequency  and  the  radius  of  the  window  function 


\|/  are  (d*  and  A.,  respectively,  and  set  T|(a))  =  \j/(to  +  to*).  Then,  a  window 
function  r|  has  center  at  the  origin  and  radius  equal  to  A- .  The  application  of 


the  Parseval  Identity  to  (3.33)  5delds 

  oo   

(W^f)ib,a)  =  ^^^^-  ^meib<^^(a^^^  (3.34) 

I  — oo 

Since  ri^a^co-^jj  =  T|(aa)-(0*)  =  \^(a(o)  has  radius  of  ^A. ,  (3.34)  implies 

that  the  IWT,  \|/(a(o) ,  gives  the  frequency  window  of 
I  rco*    1      CO*    1  1 

The  center  frequency  and  the  bandwidth  of  the  frequency  window  are  easily 

CO*  2A. 

computed  from  (3.35),  —  and  — 1,  respectively.  A  close  examination  of  those 


a 


parameters  reveals  that  the  ratio  of  the  two  numbers  stays  unchanged 
regardless  of  a.  Therefore,  when  the  center  frequency  increases,  so  does  the 
bandwidth  of  a  filter.  This  is  the  principle  of  the  "constant-Q"  filtering  in  the 
communication  theory. 

From  the  information  obtained  above,  a  rectangular  time-frequency 
window  can  be  computed 

CO*     1        CO*  1 

[6  +  at*  -  aA^,  b  +  at*^  aA^]  x  [-  -  "A^,  -  +  -A^]  (3.36) 


in  the  time-frequency  domain.  From  (3.36)  the  time  window  automatically 
narrows  for  detecting  high-frequency  characteristics  for  small  a,  and  widens 
for  investigating  low-frequency  behavior  for  large  a.  The  resolution  windows 
in  the  time-frequency  domain  are  shown  in  Figure  3.6  accordingly.  Since  both 

\|/  and  \j/  satisfy  the  conditions  to  be  a  window  for  the  STFT  such  as  (3.14),  the 
areas  of  the  resolution  windows  in  the  time-frequency  domain  remain 
constant  along  the  time  and  frequency  axes. 

I 

i  3.3  Multiresolution  Representation 

3.3.1  Mathematical  Properties 

The  multiresolution  representation  of  a  signal  is  to  describe  the  input 
with  its  approximation  at  a  coarser  resolution  and  the  difference  between  the 
original  and  the  approximation.  The  approximation  at  a  coarser  resolution 
can  be  further  approximated,  and  the  procedure  can  be  repeated.  In  order  to 
develop  the  multiresolution  representation  of  signals,  a  new  operator,  ^Ig)  >  is 

defined  to  denote  the  approximation  of  a  signal  f(x)  e  L%R)  at  a  resolution  2', 
The  mathematical  properties  and  requirements  of  the  approximation  operator 
are  defined  in  [Mal89a]. 

3.3.2  Implementation  of  Multiresolution  Representation 

Obviously,  there  is  a  difference  between  the  approximation  of  a  signal 
fix)  at  the  resolution  at  2^^^  and  2>.  This  difference  is  defined  as  the  "detail 


frequency 
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time 


Figure  3.6    The  resolution  windows  of  the  WT  in  the  time-frequency 
domain. 
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signal"  at  the  resolution  2^.  The  approximations  at  the  resolution  at  2'*^  and  2J 
of  a  signal  are  equal  to  its  orthogonal  projection  on  their  vector  spaces,  V2/+i 

and  Vgj  >  respectively.  Therefore,  the  detail  signal  at  the  resolution  2',  D2if,  is 
given  by  the  orthogonal  projection  of  the  original  signal  on  the  orthogonal 
complement  of  Vg/      ^2i*i-  Let       be  this  orthogonal  complement,  i.e., 

W^i  nV2j  =  0 

Using  the  property  of  (3.37)  the  multiresolution  representation  can  be 
efficiently  implemented  by  a  pyramidal  algorithm.  This  will  be  discussed  later 
in  this  section. 

'•  Mallat  proved  in  his  pioneering  work  [Mal89a,  Mal89c]  that  the  detail 

signal  D2jf  can  be  computed  by  convolving  A'^^J  with  filter  G  and  keeping 

i 

every  other  sample  of  the  output.  The  simplest  building  block  of  the 
multiresolution  representation  is  shown  in  Figure  3.7.  The  orthogonal  wavelet 

representation  of  a  discrete  signal  A^f  can  be  analyzed  by  successively 

decomposing  A'^^J  into  A  J/  and  D^if  for  -J<j<-1.  This  idea  leads  to  the 
algorithm,  called  the  Pyramidal  implementation,  which  is  shown  in  Figure 
3.8.  The  block  diagram  shows  the  decomposition  of  an  approximation  A'^^J 

into  a  further  approximation  at  a  coarser  resolution  A^/  and  the  detail  signal 
D2if.  Therefore,  a  successive  application  of  the  same  algorithm  gives  the 
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Figure  3.7  The  basic  building  block  of  the  multiresolution  representation.  It 
effectively  segments  the  input  signal  into  two  portions,  one  for 
low  frequency  approximation  and  the  other  for  high  frequency 
detail  signal. 
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Figure  3.8 


The  block  diagram  of  the  multiresolution  representa- 
tion. This  can  be  implemented  easily  by  repeating  the 
building  block,  resulting  in  the  pyramidal  structure. 
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complete  wavelet  representation  of  the  signal  /.  Hence,  the  original  discrete 

I 

signal  A^f  is  represented  by 

I 

(A^v/,(I>2./)-j<;<-i)-  (3-38) 
This  set  of  discrete  signals  is  called  an  orthogonal  wavelet  representation. 

The  properties  of  the  filters,  h{n)  and  g{n),  in  Figure  3.7  and  Figure 
3.8  are  derived  from  the  definitions  of  the  scaling  function  and  the  wavelet 
function  [Mal89a].  The  impulse  response  of  G  is  related  to  that  of  i/by 

gin)  =  (-l)i-«/i(l-n).  (3.39) 

G  is  the  mirror  filter  of  H,  and  is  a  highpass  filter.  The  relationship  defined  as 
(3.39)  is  specifically  known  as  Quadrature  Mirror  Filters  (QMF)  [Est77]. 

Coif  man  and  others  [Coi92,  Wic89]  proposed  another  scheme  called 

I 

"wavelet  packetization"  based  on  the  same  p3n"amidal  structure  in  Figure  3.8. 
Unlike  the  previous  method,  the  QMF  filter  is  applied  successively  to  the 
approximations  as  well  as  to  the  detail  signals.  Hence,  the  resulting  scheme 
looks  like  the  one  in  Figure  3.9.  This  new  method  generates  the  final  output 
coefficients  in  a  uniform  length,  which  has  an  important  meaning  especially 
when  a  vector  quantization  algorithm  is  needed  to  encode  the  output 
coefficients.  It  allows  to  use  a  well-tuned  codebook  with  a  uniform  code  length 
instead  of  several  codebooks  with  different  code  lengths.  Another  important 
feature  of  the  method  is  the  effective  frequency  segmentation.  Since  the  QMF 
building  block  divides  the  frequency  spectrum  into  a  low  subband  and  a  high 
subband  at  the  given  resolution,  the  overall  system  carries  out  a  rough 
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Figure  3.9  The  block  diagram  of  the  wavelet  packetization.  The 
length  of  output  subbands  is  fixed  according  to  the  frame 
size  and  the  depth  of  structure. 
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frequency  segmentation  into  several  subbands.  By  carefully  tuning  the  base 
wavelet,  the  performance  of  the  frequency  segmentation  may  vary 
significantly.  Since  our  research  is  based  on  the  frequency  segmentation, 
great  care  is  taken  when  choosing  the  optimal  base  wavelet. 

3.3.3  Compactlv  Supported  Wavelet 

Daubechies  [Dau88]  found  different  ways  to  compute  the  base  wavelets 
called  "compactly  supported  wavelet."  Unlike  the  previous  ones  [Mey86, 
Lem86,  Bat87],  her  base  wavelet  has  a  compact  support,  which  has 
considerable  mathematical  advantages. 

When  defining  the  multiresolution  representation  in  Section  3.3.2,  two 
sequences,  h(n)  and  g{n),  which  are  closely  related  by  (3.39),  were  adopted. 
Daubechies  imposed  several  conditions  on  h(n)  and  g(n)  based  on  the 
properties  of  compact  support,  and  derived  the  scaling  functions  and 

wavelet  functions      in  [Dau88,  Dau92] . 

Sequence  h(n)  should  have  the  following  properties  in  order  to  serve  as 
an  orthonormal  basis; 

^\h(n)\\n\^<oo  for  some  E>0,  (3.40) 

n 

^h{n -2k)hin- 21)  =  6 f^l,  and  (3.41) 

n 
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E  hin)  =  1 


(3.42) 


n  =  -°° 


To  simplify  the  notations  suppose  that  mgCco)  =    X  hin)e''"'^  can  be  written 


as 


mo(to)  =  1^2(1 +  e'*")J 


tflCO 


(3.43) 


where 


X  1/(^)1  l'i|^<°°  for  some  e>0,  and 


(3.44) 


n  =  ■ 


sup 


(oe 


n  =  -oo 


2iv-i, 


(3.45) 


j  Considering  (3.43),  we  may  find  a  different  representation  of  the 
Fourier  transform  of  scaling  function  using  the  new  definition  mo, 


(|)(co)  =  V(27c)-i  n  ^o(2"^«) 
;=  1 


(3.46) 


Daubechies  has  proved  [Dau88]  that  the  wavelet  functions  \\f{x)  can  be 
represented  with  the  scaling  function  in  (3.46)  as 


¥(^)  =    X  g(n)(^(2x-n). 


(3.47) 


n  =  — oo 
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which  is  exactly  the  same  with  that  of  the  multire solution  representation 
[Mal89a].  Therefore,  the  same  sequences  h(n)  and  g(n),  which  satisfy  (3.39) 
through  (3.45),  can  be  used  to  compute  the  multiresolution  representation  of 
input  signals.  This  property  has  a  practical  importance  since  the 
multiresolution  representation  provides  an  easy  platform  to  implement  the 
wavelet  transforms.  By  the  same  principle,  our  simulations  are  based  on  the 
Daubechies  wavelets,  and  are  implemented  in  the  multiresolution 
representation. 

3.4  Psvchoacoustical  Modeling 

In  order  to  achieve  sound  transmission  or  reproduction  that  is  not  only 
very  good  but  also  efficient,  the  redundancy  in  the  encoded  signal  should  be 
minimized.  In  our  research,  since  the  final  receiver  is  assumed  to  be  a  human 
auditory  system,  any  part  of  the  processed  signal  that  is  not  recognized  by  the 
auditory  system  provides  unnecessary  redundancy.  Therefore,  all  equipment 
in  the  encoder  and  the  decoder  has  to  be  adapted  to  the  characteristics  of  the 
human  ear.  :  ^     •  . 

There  are  many  types  of  measuring  equipment  available  which 
visualize  different  objective  characteristics  of  sounds.  However,  human 
auditory  systems  look  for  somewhat  different  properties  of  sounds.  They  do 
not  perceive  frequency,  but  rather  perceive  pitch;  do  not  perceive  level,  but 
loudness;  do  not  perceive  spectral  shape,  modulation  depth,  or  frequency  of 
modulation,  but  instead  do  perceive  sharpness,  fluctuation  strength,  or 
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roughness,  respectively.  Also,  they  do  not  perceive  time  directly,  and  their 
perception  depends  on  a  subjective  duration,  often  quite  different  from  the 
physical  duration  [Bur 92,  Zwi90].  In  addition,  masking  plays  an  important 
role  in  the  frequency,  as  well  as  in  the  time  domain.  In  the  early  1970s, 
Zwicker  and  Burkhard  found  that  the  information  received  by  our  auditory 
system  can  be  described  most  effectively  in  the  three  dimensions  of  specific 
loudness,  critical  band  rate,  and  time.  The  resulting  three  dimensional 
pattern  is  the  measure  from  which  the  audible  characteristics  of  the  original 

i 

sound  can  be  reconstructed  [Bur92,  Zwi90,  Zwi91].  This  is  one  of  the  basic 
principles  on  which  this  research  is  based. 

3.4.1  Transformation  of  Frequency  to  Critical  Band  Rate 
:  Masking  usually  is  described  as  the  sound  pressure  level  of  a  test  sound 
necessary  to  be  barely  audible  in  the  presence  of  a  masker.  For  narrow-band 
noises  used  as  maskers  and  pure  tones  used  as  test  sounds,  masking  patterns 
can  be  obtained  for  different  center  frequencies  of  the  narrow  band  noise 
masker,  as  shown  in  Figure  3.10.  The  level  of  the  barely  audible  pure  tone  is 
plotted  as  a  function  of  frequency  on  a  linear  scale  in  Figure  3.10a,  in  contrast 
to  the  logarithmic  scale  in  Figure  3.10b.  The  level  of  the  narrow  band  maskers 
is  60dB  for  all  curves.  Comparing  the  results  produced  from  different  center 
frequencies  of  the  masker,  researchers  could  not  find  a  single  similarity 
among  the  masking  curves  which  govern  the  whole  frequency  range.  However, 
it  seems  as  if  the  shape  of  the  curves  is  similar  for  center  frequencies  up  to 


Figure  3.10  Narrow  band  noises  and  pure  tones  are  used  as  maskers 
and  test  sounds,  respectively.  Excitation  level  of  narrow 
band  noises  of  given  center  frequency  are  shown  as  a  func- 
tion of  frequency.  The  dotted  hne  shows  the  threshold  in 
quiet,  (a)  linear  scale,  (b)  logarithmic  scale  [Zwi90]. 
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about  500Hz  on  a  linear  frequency  scale,  while  for  center  frequencies  above 
500Hz  there  is  a  similarity  on  a  logarithmic  frequency  scale.  In  order  to 
accommodate  this  new  finding,  researchers  have  come  up  with  the  new 
measure  called  "critical  band  rate",  which  has  a  linear  frequency  scale  up  to 
about  500Hz  and  then  a  logarithmic  frequency  scale  above  500Hz.  An 
anatomical  analysis  of  the  human  inner  ear  (including  the  basilar  membrane) 
indicates  that  the  critical  band  rate  scale  is  directly  related  to  the  place  along 
the  basilar  membrane  where  all  the  sensory  cells  are  located  in  a  very 
equidistant  configuration.  The  resulting  scale  conforms  to  the  way  the  human 
auditory  system  perceives  sound. 

The  critical  band  concept  is  based  on  the  well  proven  and  widely 
accepted  assumption  that  our  auditory  system  analyzes  a  broad  spectrum  in 
parts  that  correspond  to  critical  bands.  Adding  one  critical  band  to  the  next,  so 
that  the  upper  limit  of  the  lower  critical  band  corresponds  to  the  lower  limit  of 
the  next  higher  critical  band,  produces  the  scale  of  the  critical  band  rate.  Since 
critical  bands  have  approximately  lOOHz  width  up  to  500Hz  and  above  500Hz 
take  a  relative  width  of  20%,  it  becomes  clear  that  the  critical  band  rate  is 
dependent  on  frequency  [Sch70,  Zwi90,  Zwi91].  The  broad  acceptance  of  the 
critical  band  concept  in  many  models  and  hypotheses  called  for  a  unit  for  the 
critical  band  rate.  Hence,  the  "bark"  was  defined,  where  one  bark  is  one 
critical  band  wide. 

When  frequency  is  transformed  into  critical  band  rate,  the  masking 
patterns  outlined  in  Figure  3.10  change  to  that  in  Figure  3.11,  where  the  level 
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Figure  3.11  Excitation  level  versus  critical  band  rate  for  narrow  band 
noises  of  given  center  frequency  and  60dB  sound  pressure 
level.  Contrary  to  Figure  3.10,  the  sound  pressure  level  is 
shown  as  a  function  of  critical  band  rate. 
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of  the  barely  audible  pure  tone  is  plotted  as  a  function  of  the  critical  band  rate 

t 

for  the  same  narrow  band  maskers  shown  in  Figure  3.10.  The  effectiveness  of 
the  critical  band  rate  scale  is  clearly  revealed.  The  shapes  of  the  curves  for 
different  center  frequencies  appear  very  similar. 

I  The  critical  band  rate  scale  not  only  describes  the  masking  effect  more 

simply  but  also  makes  more  easily  understandable  many  other  effects  such  as: 

I 

pitch,  barely  noticeable  frequency  differences,  or  the  growth  of  loudness  as  a 
function  of  bandwidth  [Bur92].  That  is  mainly  because  that  this  new  scale 
conforms  to  the  human  physiology  more  closely.  Therefore  when  dealing  with 
hearing  sensations,  it  is  very  effective  to  first  transfer  the  frequency  scale  into 
the  critical  band  rate  scale. 

3.4.2  Transformation  of  Level  to  Specific  Loudness 

\  When  measuring  the  loudness  in  quantitative  scale,  we  usually 
consider  the  loudness  function  of  a  IKHz  tone.  This  function  is  estabhshed  by 
determining  how  much  louder  a  sound  is  heard  relative  to  a  standard  sound. 
The  standard  sound  in  electroacoustics  is  a  IKHz  tone,  and  the  reference  level 
in  this  case  is  40dB.  Many  measurements  of  different  laboratories  produced 
similar  results  so  that  eventually  the  loudness  function  of  a  IKHz  tone  in  the 
free  field  was  standardized,  which  is  given  as  a  sohd  curve  in  Figure  3.12  in 
terms  of  the  sound  pressure  level  (SPL)  [Zwi90,  Zwi91].  With  the  definition 
that  a  IKHz  tone  of  40dB  SPL  has  the  loudness  of  1  sone,  the  curve  indicates 
that  doubhng  the  loudness  from  1  to  2  sone  is  equivalent  to  increasing  the 
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sound  pressure  level 


Figure  3.12  The  loudness  function  of  a  uniform  exciting  noise  is  shown  by 
the  dotted  line,  and  the  loudness  function  of  the  IKHz  tone  is 
shown  by  the  solid  line.  The  dot-dashed  line  and  the  dashed 
line  are  straight  approximations  of  both  of  the  loudness  func- 
tions. 


51 


sound  pressure  level  from  40  to  50  dB.  The  same  holds  for  larger  levels:  a 
doubling  in  loudness  is  achieved  with  each  increment  of  lOdB  of  the  iKHz 
tone.  This  means  that  50dB  corresponds  to  2  sone,  while  lOOdB  corresponds  to 
64  sone. 

The  total  loudness  is  comprised  of  many  partial  loudness  fractions 
which  are  located  along  the  critical  band  rate  scale.  The  physiological 
equivalent  of  this  assumption  would  be  that  all  the  neural  activities  of  the 
sensory  cells  along  the  basilar  membrane  are  summed  up  to  a  value  that 
finally  leads  to  the  total  loudness.  If  the  summation  or  integral  leads  to  the 
total  loudness  that  is  given  in  unit  of  sones,  the  loudness  has  to  have  the 
dimension  of  sones  per  bark.  This  value  is  called  specific  loudness  and  is 
denoted  by  N' .  The  total  loudness  N  is  thus  the  integral  of  specific  loudness 
over  the  critical  band  rate,  which  can  be  expressed  mathematically  as  follows: 


Although  easily  describable  in  purely  physical  terms,  the  iKHz  tone  produces 
a  complex  pattern  of  excitation  and  a  complicated  specific  loudness  pattern 
[Zwi90,  Zwi91].  Therefore,  we  have  to  search  for  a  sound  that  produces  more 
homogeneous  excitation  versus  the  critical  band  rate  pattern.  This  sound  is 
the  uniform  exciting  noise,  which  fills  up  the  entire  frequency  range  in  such  a 
way  that  the  same  sound  intensity  falls  into  each  of  the  24  abutting  critical 
bands.  The  loudness  of  such  a  uniform  exciting  noise  was  measured.  It  was 
found  that  the  loudness  of  1  sone  is  reached  at  a  level  of  about  30dB  for 
uniform  exciting  noise.  The  entire  loudness  function  of  uniform  exciting  noise 


(3.48) 
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is  shown  by  the  dotted  Hne  in  Figure  3.12  [Zwi90,  Zwi91].  The  curve  rises 
somewhat  more  rapidly  with  level  than  the  loudness  of  the  IKHz  tone,  at  least 
for  levels  of  uniform  exciting  noise  to  about  50dB.  Above  60dB,  the  dotted  line 
can  also  be  approximated  by  a  straight  line,  which  is  shown  dotted-dashed  in 
Figure  3.12.  It  is  interesting  to  see  that  the  loudness  of  uniform  exciting  noise 
is  much  larger  than  the  loudness  of  the  IKHz  tone  in  almost  the  entire  level 
range  indicated.  For  example,  the  loudness  of  a  60dB  uniform  exciting  noise  is 
about  3.5  times  larger  than  the  loudness  of  the  IKHz  tone  with  the  same  level. 
It  indicates  very  clearly  that  an  overall  sound  pressure  level  of  broad-band 
noises  is  an  extremely  inadequate  value  if  loudness  is  to  be  approximated. 
Unfortunately  most  noise  which  produces  annoyance  to  people  is  broadband 
noise,  and  most  of  the  sound  pressure  level  is  a  measure  of  the  total  level, 
which  creates  misleading  values  when  used  as  an  indication  for  loudness. 
Therefore,  meters  based  merely  on  total  level,  such  as  UV  or  peak-level 
meters,  usually  give  readings  quite  unrelated  to  loudness,  although  these 
readings  should  correspond  to  loudness  sensation  as  closely  as  possible  from 
the  view  of  the  listener  as  the  final  receiver.  As  an  extreme  case,  researchers 
tried  to  achieve  64dB  sound-pressure  level  using  the  uniform  exciting  noise 
and  the  narrow  band  noise,  respectively,  and  also  computed  the  loudness  in 
the  unit  sone.  The  result  is  interesting  enough;  only  50dB  uniform  exciting 
noise  can  lead  to  64dB  sound-pressure  level  and  it  gives  the  total  of  20  sone 
compared  to  5  sone  of  the  narrow  band  noise. 
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3.4.3  Postmasking  Effects 

There  are  many  temporal  effects  in  masking.  However,  only  the 
postmasking  is  discussed  here  since  it  has  the  largest  impact  on  the  efficient 
coding  for  digital  audio  [Zwi90] .  Postmasking  results  from  the  gradual  release 
of  the  effect  of  the  masker,  even  after  the  masker  goes  off,  that  is,  the  effect 
still  remains  for  some  time.  The  duration  of  the  postmasking  depends  on  the 
duration  of  the  masker.  Researchers  have  conducted  an  experiment  where  a 
2KHz  test  tone  burst  of  5ms  was  presented  after  a  duration  of  the  masker, 
which  is  a  uniform  masking  noise  at  60dB.  The  result  is  shown  in  Figure  3.13, 
where  the  sound  pressure  level  of  the  test  tone  burst  is  on  the  ^'-axis  and  the 
delay  time  in  on  the  a:- axis  [Zwi90].  The  solid  curve  indicates  that  the  duration 
of  the  masker  is  200ms  and  the  dotted  line  for  5ms.  The  masking  effect 
decreases  as  a  function  of  the  delay  time.  However,  the  postmasking 
generated  by  a  very  short  burst  such  as  5ms  behaves  quite  differently.  It 
decays  much  faster  than  a  longer  masker,  which  implies  that  the  postmasking 
mainly  depends  on  the  duration  of  the  masker  and  is  quite  non-linear. 

Simultaneous  masking  and  postmasking  can  be  used  to  approximate 
the  time  functions  of  the  specific  loudness.  The  specific  loudness  for  a  burst 
tone  of  200ms  and  that  for  a  burst  tone  of  5ms  is  plotted  in  Figure  3.14.  The 
burst  tones  are  located  on  the  linear  time  scale  in  such  a  way  that  both  bursts 
end  at  the  same  instance.  For  the  200ms  burst  tone,  the  subsequent  decay 
lasts  considerably  longer.  The  specific  loudness  of  the  5ms  burst  tone  rises 
quickly  to  match  that  of  the  200ms  burst  tone,  however,  the  decay  is  quite 


54 


(a) 


200ms 


5ms 


1 

1 

1 
1 

uniform  masking  noise 

 ► 

M  ► 

delay  time 


time 


delay  time 


Figure  3.13  Postmasking  depends  on  masker  duration:  (a)  Duration  of 
maskers  200ms  and  5ms;  level  of  masker  60dB;  duration 
of  2KHz  test  tone  5ms.  (b)  Level  of  barely  audible  test  tone 
burst  as  a  function  of  its  delay  time  (time  between  end  of 
masker  and  end  of  test  tone). 
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Figure  3.14  Specific  loudness  produced  by  masker  bursts  of  200ms 
(dotted  line)  and  5ms  (solid  line)  as  a  function  of  time. 
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different  and  much  faster.  The  two  different  decays  can  be  approximated  very 
roughly  by  single  time  constants  of  about  30ms  for  a  burst  tone  duration  of 
200ms  and  about  8ms  for  a  duration  of  5ms.  As  Figure  3.14  shows,  in  both 
cases  the  decay  is  much  faster  during  the  early  stage  and  gets  slower  later 
[Zwi90]. 

Based  on  these  masking  effects,  we  can  further  eliminate  the 
unnecessary  redundancies  which  do  not  have  any  effect  on  human  perception. 
Most  hkely,  the  redundant  components  appear  below  the  masking  threshold 
curves  of  human  auditory  perception. 

3.4.4  Specific  Loudness  vs.  Critical  Band  Rate  vs.  Time 

]  From  extensive  experiments  with  different  groups  of  people  researchers 
have  found  that  the  functions  of  specific  loudness  versus  critical  band  rate 
versus  time  illustrate  very  well  the  information  flow  in  the  human  auditory 
system.  As  three-dimensional  patterns,  they  contain  all  the  information  that 
is  subsequently  processed  and  leads  to  the  different  hearing  sensations 
[Bur92,  Zwi90,  Zwi91]. 

Based  on  these  findings,  our  efforts  are  focused  on  the  development  of 
an  efficient  algorithm  which  can  transform  a  frame  of  input  audio  into  its 
three-dimensional  representation  in  the  psychoacoustical  domain.  In  addition, 
the  inherent  properties  of  the  wavelet  transforms  play  a  key  role  in  the  work 
because  human  sensory  systems  detect  the  input  energy  in  the  logarithmic 
scale. 
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3.5  Evolution  in  Speech  Coding 
The  United  States  Department  of  Defense  (DoD)  has  been  a  major 
driving  force  of  the  development  and  deployment  of  the  Secure  Telephone 
Unit  (STU)  for  the  last  several  decades.  When  the  Linear  Predictive  Coding 
(LPC)  algorithm  was  first  introduced  in  [Ata71],  speech  coding  was  expected 
as  the  first  and  major  application  of  the  algorithm.  Since  it  is  based  on  the 
approximation  of  the  human  vocal  tract  with  a  vocal  cord,  the  algorithm  is 
especially  useful  to  encode  and  decode  human  voices.  Over  time  the  structure 
of  the  conventional  LPC  vocoder  (voice  encoder/decoder)  has  been  well 
criticized  and  has  evolved,  and  different  variations  are  shown  briefly  in  Figure 
3.15.  Depending  on  the  requirements  and  design,  the  encoder  computes  from  a 
frame  of  input  speech  the  different  parameters  such  as  the  LPC  coefficients, 
the  pitch  information,  the  energy,  and  even  the  residue  signal.  The  decoder 
forms  a  digital  filter  with  the  LPC  coefficients  and  excites  it  with  an  excitation 
signal,  and  the  output  of  the  digital  filter  is  the  synthesized  speech  of  the 
decoder.  Theoretically  the  decoder  is  able  to  reproduce  the  perfect 
reconstruction  when  the  residue  signal  excites  the  digital  filter  [Rab78]. 
However,  sending  the  residue  signal  via  the  communication  channel  or 
storing  them  on  storage  media  is  often  costly  in  terms  of  the  bandwidth  usage. 
Therefore,  it  should  be  approximated  to  reduce  the  overall  bit  rate  of  the 
coding  algorithm.  There  are  many  different  ways  to  accomplish  the 
approximations,  and  different  algorithms  are  usually  named  after  the  way 
they  approximate  the  excitation  signal. 
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Figure  3.15  A  block  diagram  of  various  LPC-based  vocoders. 
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The  lowest  bit  rate  systems  require  every  possible  approximation  on 
the  residue  signal.  The  encoder  extracts  the  most  influential  information  from 
the  residue,  called  "pitch,"  and  the  overall  energy,  then  sends  them  with  the 
LPC  coefficients  to  the  decoder.  The  pitch  determines  if  the  frame  is  Unvoiced 
or  Voiced  (UAO,  and  plays  a  very  important  role  in  the  overall  performance  of 
the  coder.  On  the  decoder  side,  the  overall  energy  parameter  determines  the 
magnitude  of  the  excitation  signal,  and  the  pitch  determines  the  way  the 
excitation  signal  is  manufactured;  if  the  pitch  indicates  the  frame  is  Unvoiced, 
then  the  excitation  signal  is  replaced  with  a  white  noise;  otherwise,  it  becomes 
a  train  of  impulses  spaced  exactly  by  the  pitch.  This  kind  of  speech  coder 
usually  achieves  a  decent  quality  at  2400  bps,  which  has  been  the  dominating 
requirement  of  the  market,  and,  subsequently,  the  DoD's  STUs  at  2400bps  are 
all  based  on  this  principle. 

One  immediate  variation  of  the  approach  is  to  quantize  the  residue 
signal  directly  and  to  transmit  them  to  the  decoder.  This  class  of  vocoders  is 
known  as  the  Residue  Excited  Linear  Predictive  vocoders  (RELP).  The  overall 
performance  is  noticeably  better  than  the  conventional  LPC  vocoders  at  2400 
bps.  However,  their  bandwidth  requirement  goes  easily  to  the  range  of  8Kbps 
-  9.6Kbps  or  higher. 

In  the  early  1980s  another  variation  emerged  in  the  industry  called 
"Multipulse  LPC."  This  method  tries  to  approximate  the  original  speech 
waveform  by  exciting  a  digital  filter  with  a  series  of  impulses.  The  digital 
filter  used  for  synthesis  is  formed  with  the  conventional  LPC  coefficients 
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obtained  from  the  encoder,  and  the  optimal  train  of  impulses  are  computed 
toward  minimizing  the  difference  between  the  original  and  the  synthesized. 
Therefore,  the  encoder  should  contain  a  copy  of  the  decoder,  and  for  that 
reason  it  is  also  called  as  an  "analysis-by-synthesis"  method.  The  actual 
output  of  the  encoder  includes  the  location  of  the  impulses  and  their 
amplitudes  in  addition  to  the  usual  LPC  coefficients.  The  European  digital 
cellular  phone  standard,  Groupe  Speciale  Mobile  (GSM),  is  a  variation  of  this 
algorithm  [ETS89].  The  bandwidth  requirement  based  on  this  method  ranges 
from  8Kbps  to  13Kbps. 

«;  Fairly  recently,  another  class  of  vocoders,  Codebook  Excited  Linear 
Predictive  vocoders  (CELP),  have  been  developed  [Kem89].  This  time  the 
excitation  signal  is  approximated  with  a  series  of  pre-determined  code  words. 
The  encoder  first  computes  the  usual  LPC  coefficients  of  the  given  input 
speech,  then  tries  to  find  the  optimal  combinations  of  the  predetermined 

codewords,  which  eventually  approximates  the  excitation  signal.  It  also 

t 

contains  a  copy  of  the  decoder  and  performs  many  decoding  processes  to  find 
the  best  sequence.  Once  it  finds  the  optimal  solution,  then  the  indices  of  the 
combination  are  sent  to  the  other  side  for  a  proper  synthesis.  Since  the  indices 
of  the  sequence  are  transmitted  rather  than  actual  signal,  the  bandwidth 
saving  is  remarkable  without  sacrificing  the  quality.  That  is  one  of  the 
reasons  why  the  DoD  has  chosen  the  next  generation  of  STUs  at  4800  bps 
based  on  the  CELP  algorithm.  The  usual  bandwidth  usage  for  this  algorithm 
ranges  from  4.8Kbps  to  9.6Kbps  or  higher. 
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Many  researchers  have  found  that  the  quahty  and  the  bandwidth  usage 
are  related,  and  that  their  relationship  is  not  always  linear.  They  also  found 
that  good  speech  coding  algorithms  do  not  necessarily  work  well  on  audio 
signals  even  with  extra  bandwidth.  This  implies  that  a  totally  separate 
approach  should  be  taken  to  develop  an  audio  coding  algorithm.  The  evolution 
of  speech  coding  algorithms  is  surely  continuing.  The  most  recent  successful 
implementation,  CELP,  is  based  on  the  codebook  approach  since  it  provides 
relatively  high  quality  at  a  considerably  low  data  rate.  This  apparent  success 
encourages  us  to  check  the  possibility  of  encoding  audio  signals  with  a 
codebook  approach  in  a  totally  different  aspect. 

3.6  Svstem  Proposition 
Zwicker  and  coworkers  [Zwi90,  Zwi91]  found  that  the  time-varying 
specific  loudness  along  the  critical  band  rate  can  provide  sufficient 
information  to  reconstruct  the  original  audio  signal  which  is  transparent  to 
human  auditory  sensors.  In  addition,  the  human  auditory  system  is  well 
known  to  work  in  the  logarithmic  scale  in  most  of  the  mid  and  high  frequency 
ranges  when  it  discriminates  frequency  differences.  It  is  also  well  known  that 
the  wavelet  transforms  work  as  a  dilation  and  translation  of  the  base  wavelet, 
so  the  resulting  system  basically  operates  in  the  logarithmic  scale.  In  this 
research  an  algorithmic  approach  is  proposed  to  represent  audio  signals  in  the 
3-dimensional  notation  of  the  critical  band  rate,  specific  loudness  and  time. 
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When  a  frame  of  audio  input  is  processed,  a  set  of  wavelet  coefficients  is 
obtained.  Then,  the  next  step  is  to  find  how  to  relate  them  with  the  specific 
loudness  along  the  critical  band  rate.  The  analysis  tree  structure  of  the 
wavelet  transform,  which  may  look  similar  with  Figure  3.9,  is  carefully  chosen 
to  meet  the  characteristics  of  human  physiology,  so  that  the  boundaries  of  the 
subbands  match  closely  the  physical  locations  of  human  auditory  sensors. 
This,  in  turn,  means  that  each  subband  in  the  tree  structure  is  directly 
affiliated  to  the  corresponding  critical  band  rate.  The  exact  tree  structure  used 
in  this  research  is  developed  in  the  following  sections. 

I  When  the  critical  band  rates  are  determined  the  specific  loudness  can 

be  approximated  relatively  easily  by  computing  the  energy  for  each  subband. 
There  are  several  ways  to  measure  energy,  and  in  this  research  the  mean- 
square  of  the  wavelet  coefficients  for  each  subband  is  used  to  determine  the 
amount  of  energy  for  the  subband. 

!  Once  a  given  input  audio  frame  is  segmented  to  the  corresponding 
subbands  and  their  energy  distribution  is  computed,  a  set  of  coefficients 
should  be  found  which  produces  most  similar  energy  distribution  when 
hnearly  combined  with  the  codewords  from  the  pre-determined  codebook. 
Since  each  codeword  is  highly  tuned  to  represent  the  corresponding  subband, 
the  resulting  scheme  is  to  represent  a  given  set  of  wavelet  coefficients  with  the 
codewords  minimizing  the  overall  difference  in  the  energy  distribution  over 
the  subbands. 
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The  preliminary  study  indicates  that  the  overall  system  has  a  solid 
theoretical  background.  The  development  of  a  broadband  audio  coder  usually 
demands  tremendous  resources  over  a  considerable  period.  Based  on  all  the 
motivations  described,  however,  we  are  highly  motivated  to  develop  one 
within  a  limited  availability  of  resources. 


CHAPTER  IV 
APPROACHES  AND  SIMULATIONS 


4.1  Selection  of  Base  Wavelet 

4.1.1  Orthogonality 

A  wavelet  function  \\f  e  L^(R)  is  called  an  orthonormal  wavelet,  if  the 
family     J  is  an  orthonormal  basis  of  L^(W);  i.e., 

<  V;,  k'  ¥z,  J  =  h  rhm       h  k,l,m&I.  (4.1) 

and  every  /  e  L\R)  can  be  written  as 

oo 

fix)  =    X  (4.2) 

j,  fe  =  -oo 

where  the  convergence  of  the  (4.2)  is  in  L\R).  The  series  representation  of /in 
(4.2)  is  called  a  wavelet  series.  Analogous  to  the  notion  of  Fourier  coefficients, 
the  wavelet  coefficients  c^j^  are  given  by 

•  (4.3) 
I        Unlike  orthogonal  wavelets  there  exist  non-orthogonal  wavelets  which 
do  not  satisfy  (4.1).  In  order  to  describe  these  wavelets  we  borrow  the 
definition  of  "frames"  from  the  mathematical  society.  A  family  of  functions 
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{'^fj)j  g  J  in  a  Hilbert  space  H  is  called  a  frame  if  there  exists  0  <  A ,  5  <  so 
that,  for  all  /  in  H, 

A\\fP<'Z\{f,x\fP\^<B\\fP.  (4.4) 

;  je  J 

A  and  B  are  called  the  frame  bounds.  When  A  =  B,  the  frame  is  called  as  a 
tight  frame.  When  the  frame  is  tight  and  A  =  1,  then  the  functions  \|/^ 

constitute  an  orthonormal  basis.  The  values  of  the  frame  bounds  are 
sometimes  used  to  denote  the  degree  of  redundancy  in  the  representations 
which  use  the  functions      as  their  basis  [Dau92]. 

;  Non-orthogonal  wavelets  allow  a  certain  degree  of  redundancy  in  their 
representation,  and  are  used  for  applications  which  require  an  extra 
redundancy,  such  as  edge  detection  in  image  processing  [Mal89b].  On  the 
contrary,  orthogonal  wavelets  do  not  allow  any  redundancy  in  their 
representation,  which  means  that  they  usually  perform  better  in  terms  of 
compression.  Due  to  their  distinct  mutually  compensating  properties  there 
has  been  a  lot  of  research  going  on  for  both  orthogonal  wavelets  and  non- 
orthogonal  wavelets.  In  this  research  we  restrict  our  interests  to  orthogonal 
wavelets  and  their  properties  since  our  main  objective  is  to  represent  a  source 
signal  with  minimal  redundancy. 


4.1.2  Linearity 

It  is  well  proven  theoretically  that  the  wavelet  transforms  and  the 
multiresolution  representation  are  hnear  systems  [Chu92,  Dau88,  Mal89a, 

I 

Mal89b,  Rio91].  In  order  to  verify  the  linearity  of  these  systems  with  actual 
data,  a  series  of  simulations  with  simple  data  were  conducted;  input  frames 
with  various  impulses  were  transformed  to  the  wavelet  domain  through  the 
multiresolution  representation.  Then,  the  coefficients  in  the  wavelet  domain 
were  added  together  with  different  weighting,  and  the  inverse  wavelet 
transforms  were  applied  to  the  sum.  The  synthesized  data  stream  consisted  of 
a  series  of  impulses,  the  locations  and  magnitudes  of  which  matched  exactly 
with  those  of  the  sum  of  the  original  input  frames  in  the  time  domain. 

J 

Like  many  algorithms,  this  research  assumes  that  the  whole  system  is 
mathematically  linear.  It  is  especially  true  since  the  proposed  system  finds 
the  optimal  coefficients  from  the  WT  coefficients  and  generates  the  time 
domain  signal  based  on  those  coefficients.  The  candidates  for  the  base  wavelet 
for  the  research  are,  therefore,  all  tested  for  their  Hnearity. 

4.1.3  Properties  of  Mever's  Wavelet  and  Lemarie's  Wavelet 

,  A  basis  of  wavelets  is  defined  to  be  an  orthonormal  basis  whose 
functions  are  dyadic  scaling  and  translates  of  just  a  finite  number  of  them. 
The  well-known  example  is  the  basis  of  Haar  functions  [HaalO]  defined  as 
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Hx)  = 


'  1, 

. -1, 


0<x<l/2 


l/2<x<l, 


(4.5) 


0,  otherwise 


which  has  the  following  useful  properties: 

(a)  The  basis  consists  of  all  dyadic  scaling  and  translates  of  a  finite  col- 
lection of  wavelets  \|/ •  for  i  =  1 . . .  n . 


(b)  \|/-  is  a  piecewise  polynomial  supported  on  the  cube  as  defined  in 
(4.5).  Thus  it  provides  sharp  locahzation  but  poor  regularity. 

(c)  For  all  indices  a  for  which  |a|  is  less  than  or  equal  to  the  maximum 
degree  of  the  polynomials  used, 


Even  though  these  functions  provided  new  ideas  in  different  theories,  their 
lack  of  regularity  has  caused  considerable  technical  obstacles  and 
computational  inconvenience  [Bat87] . 

Meyer  has  created  a  basis  of  wavelets  which  is  defined  to  be  an 
orthonormal  basis  of  L([R'^)  with  the  following  interesting  properties  [Mey86]: 

(a)  The  basis  consists  of  all  dyadic  scahng  and  translates  of  a  finite  col- 
lection of  wavelets      for  i  =  1 ...  2^^  -  1 . 

(b)  is  a  Schwartz  function. 


oo 


(4.6) 


which  means  that  \|/-(cd)  s  0  to  some  finite  order  when  co  =  0 . 
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(c)  (4.6)  holds  for  all  indices  a,  which  means  \^ii(o)  =  0  to  infinite  order 
at  CO  =  0 . 

In  addition,  Lemarie  computed  a  basis  of  wavelets  whose  properties 
complement  the  proceeding  properties  in  the  following  way: 

(a)  The  basis  consists  of  all  dyadic  scaling  and  translates  of  a  finite  col- 
lection of  wavelets  Vj/j  for  i  =  1 . . .  2<^  -  1 . 

(b)  y  ■  is  class  where  N  can  be  made  arbitrarily  large,  and  has 
exponential  localization. 

(c)  (4.6)  holds  for  |a|  <  iV-i- 1 . 

The  wavelets  of  Lemarie  are  better  suited  for  the  applications  where 
exponential  localization  is  a  more  useful  property  than  smoothness.  Since  this 
research  requires  a  high  degree  of  localization  of  signals,  the  Lemarie's 
wavelets  are  chosen  for  more  simulations.  In  fact,  extensive  simulations  with 
the  wavelets  on  actual  data  are  performed  in  Section  4.1.5.  A  procedure  to 
compute  the  Lemarie  wavelets  can  be  found  in  Mallat  [Mal89c]  and  Battle 
[Bat87]. 

The  27-point  Lemarie  wavelet  filter  h(n)  is  computed.  The  shape  of  the 
filter  and  its  frequency  characteristics  are  shown  in  Figure  4.1.  As  shown  in 
Chapter  3,  h(n)  is  a  low-pass  filter.  In  Section  4.1.5,  the  same  filter  is  used  to 
compare  its  performance  with  others. 

The  Meyer  and  Lemarie  wavelets  share  one  important  property.  As  we 
may  notice  in  Figure  4.1,  the  filter  h(n)  vanishes  to  near  zero  as  n  increases  or 
decreases  infinitely.  In  reality,  however,  it  never  reaches  zero;  instead  it 
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Figure  4.1 


The  27-point  Lemarie  wavelet  h(n);  (a)  time-domain  plot,  (b) 
the  frequency  response. 
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becomes  negligibly  small.  This  is  a  direct  comparison  with  the  wavelet  being 
defined  in  the  following  section. 

4.1.4  Daubechies  Wavelet  with  Compact  Support 

As  described  in  Section  3.2,  h{n)  should  satisfy  the  properties  of  (3.39) 
through  (3.45)  to  become  a  wavelet  sequence.  If  we  can  find  such  a  h{n)  with  a 
finite  length,  it  becomes  a  compactly  supported  wavelet  sequence. 

Daubechies  found  a  way  to  compute  the  compactly  supported  wavelet 
sequence.  In  her  pioneering  work  [Dau88,  Dau92],  she  represented  (3.43)  as 

and  computed  the  coefficients  Z„  of  L(^)  =  X^n^"'"^-  '^^^  coefficients  of  Z„  are 

n 

computed  and  shown  in  Table  4.1  for  AT  =  2  and  iV  =  10.  Since  mo(co)  is 
represented  in  the  compactly  supported  case  as 

2N-1 

^o(«)  =    Z  h{n)e-i^^,  (4.8) 
0 

we  can  compute  the  filter  coefficients  h{ji)  for  A/"  =  2  and  N  =  10.  Those  filter 
coefficients  are  calculated  accordingly  and  shown  in  Table  4.2. 

We  are  now  ready  to  compute  the  wavelet  functions  and  the  scaling 
functions  for  the  compactly  supported  cases  for  A/"  =  2  and  N  =  10.  Then,  the 
relationship  between  h(n),  ^(x)   and  \\f(x)   are  defined.  From  Mallat's 
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Table  4.1    The  coefficients  of  /„  of  L(^)  =  X„  ^n^''"^  >  for  ^=2  and  N=10. 


\T—0 

iV— iU 

1.36602540378576 

19.31118468722356 

-0.36602540378541 

-56.85728928183314 

81.30401849423677 

-73.30673703120610 

45.50299136016744 

-20.00489381966199 

6.18674374402645 

-1.29022240275302 

0.16380863548060 

-0.00960454486217 
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Table  4.2    The  filter  coefficients  h(n)  for  the  compactly  supported 
wavelets  for  A^=2  and  iV=10. 


N=2 

h(n) 

N=10 

h(n) 

0 

0.34150635094611 

0 

0.01885857879612 

1 

0.59150635094611 

1 

0.13306109139688 

2 

0.15849364905389 

2 

0.37278753574313 

3 

-0.09150635094611 

3 

0.48681405536670 

4 

0.19881887088450 

5 

-0.17666810089695 

6 

-0.13855493936042 

7 

0.09006372426667 

8 

0.06580149355052 

9 

-0.05048328559836 

10 

-0.02082962404378 

11 

0.02348490704871 

12 

0.00255021848393 

13 

-0.00758950116791 

14 

0.00098666268249 

15 

0.00140884329510 

16 

-0.00048497391993 

17 

-0.00008235450305 

18 

0.00006617718343 

19 

-0.00000937920781 
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multire solution  representation,  the  reconstruction  of  the  original  signal  is 
achieved  by  the  formula 


L  L 


c  =  ^  iH*i    (G*)d^  +  (H*)  c 
y=  1 


(4.9) 


where        and        represent  the  approximation  and  the  detail  signal, 

respectively,  at  the  level  i.  (H*)^  can  be  represented  by  a  histogram  r\i  with 

step  width  2"^  and  with  amplitudes  given  by  the  condition  that  the  area 
under  the  histogram  remains  1  for  every  I.  The  step  function  T)^  can  be 


written  as 


-1/2,  l/2[ 

)ix), 


(4.10) 


where 


(TjjDix)  =  J^h{n)f{2x-n). 


(4.11) 


The  Fourier  transforms  of  (4.10)  and  (4.11)  give 


T1;(C0)  =  (271) 


-1/2 


n  mo(2->co) 
;  =  1 


sin(2-^-ia)) 


2-1-1 


(0 


(4.12) 


Hence,  for  Z  -4  oo ,  (4.12)  becomes 


Tll(a))  =  (27t)-i/2]^/no(2-'(D). 

7=1 

By  comparing  (4.13)  with  (3.46),  we  come  up  with  the  conclusion  that 


(4.13) 
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^(x)  =  limTi^(x).  (4.14) 

/  — >  oo 

From  (4.10)  and  (4.11),  a  recursive  definition  of  r|;  can  be  derived 

r\iix)  =  S/i(n)Ti;_i(2x-n).  (4.15) 

n 

Since  this  derivation  is  based  on  the  histogram  method,  we  can  compute  the 
scahng  function  by  the  graphical  method.  The  idea  of  (4.15)  simpHfies  the 
actual  implementation  procedure.  Even  though  (4.14)  implies  that  Z  ^  oo,  the 
scaling  function  after  10  times  of  repetition  shows  no  difference  from  that  of  9 
times  of  repetition. 

Figure  4.2  shows  the  scaling  function  ior  N  =  2  computed  from  (4.10), 
(4.14)  and  (4.15).  Figure  4.3  contains  the  wavelet  function  ior  N  =  2  obtained 
by  Figure  4.2,  (3.39)  and  (3.47).  Both  of  the  plots  are  taken  at  the  resolution  of 
0.01  even  though  it  can  be  arbitrarily  small.  Even  at  a  reasonable  resolution 
the  most  notable  feature  in  the  figures  is  the  lack  of  regularity.  As  N  gets 
larger,  the  problem  of  regularity  diminishes.  Figure  4.4  and  Figure  4.5  show 
the  scahng  function  and  the  wavelet  function,  respectively,  for  N  =  10.  These 
figures  are  obtained  by  the  same  method  applied  above.  Unlike  the  previous 
case,  the  problem  of  the  regularity  has  diminished.  However,  the  support 
range  has  increased,  which  means  that  the  frequency  locahzation  may  not  be 
as  sharp  as  the  previous  case.  Thus,  applications  should  determine  the  order 
of  N  depending  on  the  requirements.  For  the  time  being,  AT  =  10  is  chosen 
since  it  provides  more  accurate  compression  and  reconstruction  than  N  =  2. 
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Figure  4.4    The   compactly   supported   scaling  function  <^{x) 
N  =  10. 
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Figure  4.5    The  compactly  supported  wavelet  function 
N  =  10. 
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Another  notable  feature  of  Figure  4.2  and  Figure  4.3  is  the  lack  of  any 
symmetry  or  antisymmetry  for  ^  and  \|/ .  It  is  quite  unusual  compared  with 
Meyer's  and  Lemarie's  wavelets,  and  in  fact,  the  Haar  basis  is  the  only 
compactly  supported  wavelet  basis  which  has  either  symmetry  and/or 
antisymmetry  around  any  axis  [Chi92,  Dau88,  Dau92] . 

The  recursive  definition  of  t)^  in  (4.15)  implies  that  all  the  T\i  have 

compact    support    of    [iV^  .,  iV^^ +] ,     with     iV^  .  =  ^{Nl_■^^  _  +  N,),  and 

Ni  +  =  ^{Ni_-^^ +  + Ni),  while  Nq  =  and  Nq_^.  =  ^.  The  fact  that 
Ni  N_  and  Ni  N^.  as  /  ^  oo  implies  that  the  scaling  function  (^{x)  has 
compact  support  of  [N_,  N+] .  Since  only  finitely  many  points  of  g{n)  are  non- 
zero, the  wavelet  function  \\f(x)  also  has  compact  support  of 


4.1.5  Performance  Comparison:  Lemarie  Wavelet  vs.  Daubechies  Wavelet 

In  order  to  compare  the  performance  of  different  wavelet  transforms, 
extensive  simulations  with  both  the  Lemarie  wavelet  and  the  Daubechies 
wavelet  were  carried  out.  In  each  simulation,  a  data  sequence  is  transformed 
into  the  multiresolution  representation  and  transformed  back  to  the  original 
domain.  Then,  the  two  signals  are  compared  with  the  original  for  error. 


1 


1 


(4.16) 
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A  data  sequence  of  512  samples  has  been  carefully  chosen  from  the 
Beethoven  Symphony  No.  5,  Part  I  [Bee89]  as  shown  in  Figure  4.6,  on  which 
both  transforms  are  executed  up  to  5th  level  in  concatenation.  As  obtained  in 
the  previous  sections,  the  27-point  Lemarie  filter  sequence  and  the  20-point 
compactly  supported  wavelet  (CSW)  filter  sequence  are  adopted.  The 
intermediate  approximations  are  shown  in  Figure  4.7  and  Figure  4.8.  Figure 
4.9  and  Figure  4.10  show  the  results  of  the  synthesis  with  the  error  signal  for 
the  Lemarie  wavelet  and  the  CSW,  respectively. 

In  both  Figure  4.8  and  Figure  4.9  the  high  frequency  portions  of  the 
signal  disappear  as  the  approximation  goes  on  since  they  are  separated  into 
the  detail  signals.  Also,  the  number  of  samples  is  reduced  by  half  at  each 
approximation,  due  to  the  orthogonality  of  the  base  wavelet. 

The  differences  in  Figure  4.10  and  Figure  4.11,  however,  are  visually 
obvious  especially  at  the  magnitudes  of  the  error  signals.  The  signal-to-noise 
ratio  (SNR)  is  not  always  an  accurate  parameter  to  measure  psychoacoustical 
performance,  yet  it  provides  relative  comparison  at  minimal  cost.  The 
computation  shows  a  SNR  of  42.35dB  for  the  Lemarie  wavelet,  and  a  SNR  of 
215.94dB  for  the  CSW,  which  indicates  that  the  CSW  transform  is  virtually 
lossless.  Such  a  notable  difference  is  mainly  due  to  the  compact  support  of  the 
basis  functions,  i.e.  the  CSW. 

We  can  now  conclude  that  the  CSW  provides  better  mathematical 
precision  than  the  Lemarie  wavelet.  For  our  purposes  high  mathematical 
precision  can  directly  translate  to  better  performance,  since  the  WT  are 
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Figure  4.7    The  approximations  of  the  original  signal  using  the 
Lemarie  wavelets. 


83 


6000 


(a)  2nd  approximation 

-T  1  1  r- 


6000 


(b)  3rd  approximation 

T  1  1- 


— id — — ^0 — to — i'oo  i'2c 


oP — to — ^0 — to — to    ^0  w 


60Q£I 


(c)  4th  approximation 


-4000 


6000 


40(0 


(d)  5th  approximation 


T        io    15  — is — t(r 


Figure  4.8    The  approximations  of  the  original  signal  using  the  com- 
pactly supported  wavelets. 
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(a)  synthesized  signal 


Figure  4.9    The  synthesized  signal  and  the  error  signal  using  the 
Lemarie  wavelets. 
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Figure  4.10  The  S3aithesized  signal  and  the  error  signal  using  the  com- 
pactly supported  wavelets. 
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usually  carried  out  several  times  in  concatenation.  Even  though  the  CSW  has 
disadvantages  such  as  irregularity  and  non-linear  phase,  it  serves  as  a  better 
base  function  for  compression  and  reconstruction.  Therefore,  further 
simulations  are  mainly  based  on  compactly  supported  wavelets. 

4.1.6  Daubechies  Wavelets  with  Vanishing  Moments 

Daubechies  explained  two  different  ways  to  synthesize  compactly 
supported  wavelets  (CSW)  [Dau88].  In  the  previous  simulation  the  base 
wavelet  was  obtained  by  recursive  computations  as  depicted  in  (4.14)  and 
(4.15).  On  the  other  hand,  the  base  wavelet  obtained  by  the  other  method 
exhibits  an  additional  property  known  as  the  vanishing  moments. 

The  basic  condition  on  h{n)  in  (4.8)  and  its  property  as  a  QMF  filter  can 
be  rewritten  as 


According  to  Proposition  3.3  in  [Dau88],  mQ(a))  can  be  represented  as 


where  Q  is  a  polynomial,  and  only  a  finite  number  of  h(n)  are  non  zero. 
Because  /nQ(co)  has  N-th.  order  zero  at  co  =  7U,  the  h{n)  should  satisfy  in  the 
time  domain 


mQ((o)|2  + |mo((0  +  7c)|2  =  1. 


(4.17) 


(4.18) 


^(-l)kk'nh(k)  =  0 


m  =  0,  ...,p-  1 


(4.19) 


k 
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where  p  indicates  vanishing  moments  [Dau88,  Sin93].  In  addition,  all  the 
coefficients  in  Q  are  real  since  all  the  h{n)  are  real.  From  (4.18)  we  have 

|/no((o)|2  =  |^cos2|]^|Q(e^'«)|2.  (4.20) 

The  polynomial  |Q(e''")|2  can  be  rewritten  as  a  polynomial  in  cosco  or  as  a 
polynomial  in  sin2^ .  Assuming  that  y  =  cos2^ ,  (4.17)  becomes 

\  y^P{l -y)  +  {l -y)^Piy)  =  1  (4.21) 

The  polynomial  of  P(y)  of  order  N-  1  is  computed  as 

N-l 

_  /iV-l  +  K 

PN(y)=  S[     j  (4-22) 
;  =  o 

which  solves  (4.21).  From  Lemma  4.2  [Dau88],  there  exists  a  trigonometric 
polynomial  of  the  same  order  such  that 

|Q(gi(0)|2  =  p|^sin2|j  =  p(^^(l- cosco) .     ;  (4.23) 

With  the  help  of  a  symbohc  processing  software  [Mat91],  the  solutions  of 
(4.23)  in  terms  of  e'^^  are  calculated  with  the  maximum  allowable  resolution. 
iV  =  20  is  chosen  since  Sinha  showed  that  the  overall  advantage  saturates 
after  20  or  more  [Sin93].  With  the  given  conditions,  18  pairs  of  conjugate 
solutions  and  two  real  solutions  are  obtained.  In  fact,  the  complex  solutions 

come  in  quadruplets,  Zq,  Zq,  Zq^  ,  and  Zq~^  ,  and  the  real  solutions  are  Tq  and 
rgi .  Since  (4.23)  represents  its  magnitude,  ^(e''")  itself  should  be  computed 
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accordingly.  However,  there  is  no  mathematical  limitation  in  selecting  Q(e"") 
from  its  magnitude.  The  only  constraint  is  to  choose  a  pair  of  roots  out  of  each 
quadruplet  and  one  root  from  each  real  root  pair,  so  the  resulting  Q(e"")  must 
have  only  real  coefficients  at  all  times.  Therefore,  there  may  be  2^^  =  1024 
different  Q(e'")  functions,  each  of  which  gives  a  different  set  of  the  wavelet 
coefficients  from  (4.18).  In  this  study,  a  Qie'-"^)  is  selected  without  any 
optimization  process. 

The  major  benefit  of  this  newly  added  property  becomes  visually 
obvious  when  the  frequency  characteristics  of  the  conventional  base  wavelet 
and  the  base  wavelet  with  maximum  vanishing  moments  are  compared  in 
Figure  4.11.  Since  the  research  is  based  on  the  clear  frequency  segmentation, 
the  sharper  cutoff  frequency  better  serves  our  objectives. 

After  a  series  of  simulations  we  finalized  the  base  wavelet  for  the 
research;  it  is  a  40-point  Daubechies  wavelet  with  20  vanishing  moments, 
whose  shape  is  plotted  in  Figure  4.12.  This  base  wavelet,  of  course,  has  some 
disadvantages  like  other  CSW's,  such  as  irregularity  and  the  lack  of  any 
symmetry.  However,  it  shows  clear  advantages,  like  compact  support  and 
better  frequency  characteristics. 
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Figure  4.11  The  frequency  characteristics  of  a  20-point  Daubechies  wave- 
let with  no  vanishing  moment  (dot  line)  and  a  40-point 
Daubechies  wavelet  with  20  vanishing  moments  (solid  line). 
The  solid  line  decays  a  lot  faster  than  the  dot  line. 
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Figure  4.12  The  40-point  Daubechies  wavelet  chosen  as  the  base  wavelet 
for  the  research.  It  has  20  vanishing  moments. 
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4.2  Justification  for  an  Efficient  Quantization  Algorithm 

4.2.1  Compression  Ratio 

The  total  number  of  samples  of  the  original  signal  and  that  of  the 
synthesized  signal  after  the  transform  are  needed  to  determine  the 
compression    ratio    of    the    wavelet    transform.    The  multiresolution 

representation  can  be  expressed  as  (4.9),  where  c°  is  the  original  signal  of 

length  A^.  An  operator  A(-)  is  defined  to  give  the  number  of  samples  at  a  given 
level  of  the  representation.  Then,  the  number  of  samples  after  the  transform 
can  be  calculated  as 

/ 

A(c°)  =  ^Aid')+Aic),  (4.24) 

7=1 

which  can  be  further  reduced  with  the  help  of  Figure  3.9.  Since  the  signals  are 
down  sampled  by  2  at  each  level,  (4.24)  finally  becomes 

I 

Mc^)  =  X  il/2yN+{l/2yN  =  N.  (4.25) 

Therefore,  the  multiresolution  representation  itself  does  not  reduce  the 
number  of  samples  at  all  even  though  the  dynamic  ranges  of  the  coefficients 
are  observed  to  be  reduced  substantially.  If  the  base  wavelet  functions  are  not 
orthogonal  it  actually  increases  the  total  number  of  samples  to  (l+l)N, 
which  means  the  redundancy  increases  as  the  level  of  the  transform  goes  up. 
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This  is  one  of  the  major  reasons  why  the  orthonormal  base  wavelets  are 
adopted  throughout  the  work. 

Even  though  the  orthogonal  base  wavelets  are  used,  the  compression 
ratio  still  stays  at  unity.  The  multiresolution  representation  consists  of  the 
(1-1)  difference  signals  and  the  dynamic  ranges  of  the  difference  signals  are 
smaller  than  the  original  signal  since  they  contain  only  the  difference  between 
the  approximations.  Due  to  the  reduced  dynamic  ranges  of  the  WT 
coefficients,  they  could  be  now  represented  by  less  number  of  bits.  Or,  they  all 
could  be  described  in  a  somewhat  different  way  based  on  other  mathematical 
or  psychoacoustical  principles.  This  quantization  process  is  essential  to 
increase  the  compression  ratio,  thus  making  this  coding  algorithm  more 
meaningful.  The  rest  of  the  research  is  mainly  devoted  to  the  development  of 
an  efficient  quantization  algorithm  which  should  be  highly  optimized  to  take 
advantage  of  the  specific  properties  of  the  wavelet  transform  and/or  the 
multiresolution  representation. 

4.2.2  Vector  Quantization 

The  wavelet  analysis  with  the  repetitive  tree  structure  defined  in 

Figure  3.9  transforms  a  frame  of  input  signal  to  2"  subbands  of  2~"iV  samples 
each  and  the  typical  shape  of  a  frame  of  WT  coefficients  is  shown  in  Figure 
4.13  accordingly  (N  =  512,  n  =  5  in  this  case).  The  preliminary  results  show 
that  the  overall  shape  of  any  given  frame  of  the  WT  coefficients  looks  very 
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Figure  4.13  A  typical  shape  of  a  frame  of  WT  coefficients. 
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similar  to  Figure  4.13,  which  has  large  fluctuations  at  first  and  relatively 
small  variations  at  the  end. 

Several  different  quantization  methods  with  different  parameter 
settings  have  been  carried  out  on  the  WT  coefficients  like  those  in  Figure  4.13; 
first,  linear  scalar  quantizations  were  performed,  and  then  non-linear  scalar 
quantizations.  The  resulting  sounds  were  not  satisfactory  mainly  because  the 
compression  ratio  was  aggressively  set.  For  both  methods,  however,  the 
original  music  [Bee89]  could  be  recognized  with  some  noise  especially  in  the 
high  frequency  range.  An  optimized  non-linear  quantization  with  a 
comfortable  compression  ratio  could  generate  reasonably  good  sound,  but 
httle  effort  was  spent  on  this  work  since  a  scalar  quantizer  was  not  our 
intended  final  target. 

A  major  problem  in  designing  a  vector  quantizer  is  the  broad  dynamic 
range  of  the  WT  coefficients;  some  are  in  the  range  of  several  thousands  and 
some  are  less  than  one.  In  order  to  accommodate  their  wide  ranges,  a  pre/post 
processing  filter  was  considered.  Many  different  pre/post  processing  filters 
were  simulated,  for  example,  ignoring  the  coefficients  beyond  a  certain 
subband,  dropping  small  coefficients  in  high  frequency  ranges,  and  a  non- 
linear preemphasis  filter.  In  all  cases,  they  provided  better  savings  on 
bandwidth  than  scalar  quantizers  at  similar  data  rates,  but  the  sound  quality 
still  remained  fuzzy  [Abu82,  Fos85,  Gra84]. 

From  all  the  simulation  efforts  for  different  quantization  methods,  we 
came  to  an  intermediate  conclusion  that  the  fine  resolutions  of  the  WT 
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coefficients  contain  highly  condensed  information,  and  that  losing  such  details 
causes  noisy  degradation  on  the  resulting  synthesized  sound  even  though  they 
are  small  in  magnitude.  Therefore,  a  different  approach  is  required  to 
maintain  the  quality  up  to  the  final  stage  of  synthesis. 

4.3  Frequency  Segmentation 

[ 

I 

In  order  to  represent  audio  signals  with  the  psychoacoustical  model 
proposed,  the  specific  loudness  per  each  critical  band  rate  should  be  obtained. 
Since  the  critical  band  rate  is  closely  related  with  frequency  and  is  not 
overlapping  with  its  neighboring  bands  at  all,  this  in  turn  means  that  a  clear 
frequency  segmentation  should  be  performed  first. 

With  the  Fourier  analysis  one  can  easily  determine  the  presence  of  a 
frequency  component  in  the  input  signal  because  the  base  function  is  a  single 
frequency  sinusoid  at  a  given  time.  However,  when  performing  the  wavelet 
transforms,  the  base  wavelet  is  not  a  single  sinusoid,  but  usually  a 
combination  of  different  frequencies.  Hence,  determining  the  presence  of  a 
single  frequency  component  using  the  wavelet  transform  is  not  an  easy  task 
as  is  the  case  with  the  Fourier  transform.  This  suggests  that  a  single 
frequency  may  influence  several  different  subbands  in  the  wavelet  domain, 
especially  in  adjacent  subbands.  The  Fourier  transform  of  the  base  wavelet 
function  in  Figure  4.11  also  shows  non-ideal  frequency  characteristics.  This 
property  contributes  to  the  effect  called  "interband  aliasing",  which,  in 
general,  hinders  the  wavelet  transforms  from  segmenting  the  frequency 
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spectrum  clearly.  The  clear  frequency  segmentation  is  not  only  the  starting 
task  for  the  research,  but  also  serves  as  the  fundamental  ground  for  future 
developments.  For  this  reason  this  issue  requires  continuous  improvements  as 
new  ideas  evolve. 

!  In  early  simulations  a  symmetric  tree  structure  was  used  to  achieve  a 

frequency  segmentation,  where  each  subband  has  the  same  bandwidth  as 

i 

shown  in  Figure  3.9.  However,  the  new  ideas  based  on  the  theories  of 
psychoacoustics  clearly  show  that  the  widths  of  the  subbands  are  varying,  and 
in  fact,  they  increase  in  the  logarithmic  scale  as  the  center  frequency  of  the 
subband  becomes  higher.  Therefore,  a  different  tree  structure  is  required  to 
match  the  critical  band  rate  boundaries  more  closely  with  the  known  human 
physiology. 

Since  high  quality  audio  signals  are  usually  sampled  at  44.lKHz  or 
higher,  the  frequency  spectrum  should  cover  at  least  up  to  22.05KHz.  Based 
on  the  principles  of  the  multiresolution  representation  and  psychoacoustics,  a 
new  tree  structure  is  developed  and  shown  in  Figure  4.14.  The  main  difference 
between  Figure  3.9  and  Figure  4.14  is  the  non-symmetry;  the  depth  of  a 
certain  subband  in  the  tree  structure  of  Figure  4.14  determines  its 
corresponding  frequency  band  and  its  width,  accordingly.  In  addition,  Figure 
4.15  indicates  that  the  number  of  WT  coefficients  in  a  subband  is  determined 
N 

by  ^  while  k  is  the  depth  in  the  tree  structure  and  N  is  the  number  of 
samples  in  a  frame. 
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Figure  4.14  The  non-symmetric  tree  structure.  This  provides  a  frequency 
segmentation  which  conforms  to  the  human  physiology  more 
closely.  The  numbers  indicate  the  boundaries  of  each  band 
and  the  dashed  lines  show  the  depth  of  the  wavelet  trans- 
form. 
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Figure  4.15  The  structure  of  a  frame  of  wavelet  coefficients  in  the  multi- 
resolution  representation.  The  number  of  samples  in  each 
subband  is  dependent  on  the  depth  in  the  tree  structure. 
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Due  to  the  discrete  nature  of  the  binary  tree,  the  boundaries  of  the 
subbands  may  not  match  perfectly  with  those  determined  mathematically  by 
the  logarithmic  scale.  The  number  of  subbands  and  their  bandwidths  are 
carefully  adjusted  to  match  the  theoretical  values  as  closely  as  possible.  The 
forthcoming  simulations  are  based  on  this  non-symmetric  tree  structure, 
which  serves  as  the  basis  for  the  psychoacoustical  model. 

4.4  Selection  of  Codebook 

Determining  the  codebook  is  of  great  importance  because  the  energy 
distribution  of  a  given  audio  frame  should  be  reproduced  by  the  linear 
combination  of  codebook  signals  along  the  critical  band  rate.  The  development 
of  a  codebook  usually  requires  a  tremendous  amount  of  computing  resources 
and  man-power.  In  this  research,  three  sets  of  mathematically  determined 
codebook  are  used  to  represent  audio  signals  in  the  WT  domain.  The  issue  of 
finding  a  better  codebook  deserves  much  more  attention  in  future  works. 

For  the  first  set  of  the  codebook,  10  frequency  components  from  each 
subband  are  chosen,  equally  spaced  in  the  first  5  subbands  which  cover  up  to 
around  500Hz,  and  logarithmically  spaced  in  the  remaining  high  23  subbands. 
For  each  band,  a  10-point  banning  window  is  applied  on  the  chosen 
components  to  prevent  possible  interband  aliasing,  especially  around  the 
boundaries  between  subbands.  Then,  a  representative  sinusoidal  signal  is 
composed  with  the  windowed  10  frequency  components.  The  resulting  narrow 
band  signal  for  each  subband  has  energy  only  on  the  corresponding  subband. 
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The  second  set  of  the  codebook  has  only  one  frequency  component  from 
each  subband.  The  subbands  in  the  Hnear  frequency  range  have  their 
representative  frequency  at  the  center  of  the  band,  and  those  in  the 
logarithmic  range  have  the  frequency  at  their  logarithmic  center. 

The  third  codebook  set  is  a  filter  bank,  which  has  28  bandpass  filters  to 
cover  the  frequency  spectrum.  The  resulting  codewords  are  band  limited 
signals  rather  than  a  peak  or  several  peaks.  The  performance  of  these 
codebooks  were  compared  with  further  simulations,  and  the  results  are 
discussed  in  a  later  chapter. 

All  three  codebooks  have  28  narrow  band  signals,  each  of  which 
exclusively  has  energy  only  in  the  respective  subband.  These  narrow  band 
signals  are  then  sampled  at  44.1KHz,  and  are  transformed  to  the  wavelet 
domain  with  the  non-symmetric  analysis  tree  structure  shown  in  Figure  4.14. 
The  general  structure  of  the  resulting  coefficients  look  like  Figure  4.15. 
i  Unlike  other  codebook  approaches,  these  codebooks  are  totally 
independent  from  the  input  signals  since  they  are  prepared  mathematically. 
Therefore,  the  current  system  does  not  require  any  form  of  training  at  all, 
and,  by  the  same  token,  does  not  need  different  codebooks  for  varying 
environments.  This  also  leaves  ample  room  to  improve  the  codebook  selection 
procedure.  Again,  finding  the  optimal  codebook  takes  a  tremendous  amount  of 
resources  and  a  different  codebook  may  affect  the  overall  performance 
significantly.  This  issue  is  a  serious  subject  for  future  works. 
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!         The  chosen  sinusoids  of  the  first  and  second  codebooks  have  arbitrary 

I 

phases  since  researchers  have  found  that  phase  has  very  little  effect  on 
human  perception  in  general  [Zwi90,  Zwi91].  The  amplitude  and  the  phase 
can  be  adjusted  later  psychoacoustically,  and  are  subjects  for  future  work. 

i  4.5  Frequency  Signature 

!  The  28  narrow  band  signals  of  the  codebooks  have  been  analyzed 
according  to  Figure  4.14,  and  the  energy  distributions  of  several  codewords 
are  shown  in  Figure  4.16.  Similar  figures  of  the  full  28  codewords  can  be  found 
in  Appendix.  They  indicate  that  each  narrow  band  signal  does  not  exactly 
match  its  corresponding  subband,  mainly  because  the  frequency 
characteristics  of  the  base  wavelet  are  not  ideal.  However,  they  show  a  clear 
tendency  that  the  major  energy  distributions  move  toward  the  high  frequency 
subbands  as  the  frequency  of  the  input  narrow  band  signal  increases. 

In  addition  to  this  tendency  another  valuable  property  is  that  each 
codeword  has  a  unique  signature  on  its  energy  distribution.  Their  uniqueness 
is  proved  by  the  following:  the  energy  distributions  of  all  28  narrow  band 
signals  are  computed  as  row  vectors,  then  collected  into  a  form  of  28-by-28 
matrix.  The  singular  value  analysis  of  this  matrix  shows  that  it  has  full  rank 
and  the  smallest  eigenvalue  is  relatively  large  compared  to  the  largest 
eigenvalue.  Therefore,  the  energy  distributions  of  the  28  narrow  band  signals 
constitute  a  mutually  independent  basis.  In  other  words,  a  given  energy 
distribution  of  an  input  frame  can  be  represented  by  a  linear  combination  of 
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the  energy  distributions  of  the  narrow  band  signals.  Upon  finding  the  optimal 
coefficients,  the  synthesis  algorithm  can  combine  them  hnearly  with  the 
narrow  band  signals  and  produce  the  corresponding  output  signal. 

!  4.6  Optimization-Trial  I 

I  In  order  to  find  the  optimal  linear  coefficients  for  a  given  frame,  an 
adaptive  signal  model  shown  in  Figure  4.17  was  adopted  [Wid85].  Since  the 
linear  coefficients  constitute  an  unknown  system  in  the  scheme,  the  overall 
issue  becomes  a  problem  of  system  identification;  i.e.,  for  a  given  energy 
distribution  a  set  of  optimal  linear  coefficients  should  be  computed  according 
to  the  mathematical  basis  of  the  pre-determined  codebook.  The  energy 
distribution  of  an  input  frame  along  the  critical  band  rate  becomes  the  desired 
signal  d  with  which  the  approximated  signal  is  compared.  The  coefficients  of 
the  linear  combiner  are  adaptively  adjusted  to  minimize  the  error  signal  at 
every  iteration.  The  actual  cost  function  becomes  the  square  of  the  error 
signal  e. 

After  a  series  of  simulations  a  serious  problem  was  found  in  obtaining 
the  optimal  solution  from  the  scheme  shown  in  Figure  4.17.  In  order  to 
explain  the  issue  more  easily,  assume  that  the  input  is  a  sum  of  the  two 
narrow  band  signals,  a  and  b.  Then,  by  the  property  of  linearity  the  WT 
produces  the  corresponding  wavelet  coefficients,  which  are  represented  as  the 
sum  of  a  and  P,  where  a  and  P  are  the  sets  of  the  WT  coefficients  of  the  signal 
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'  Figure  4.17  The  block  diagram  of  the  optimization  routine. 
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a  and  b,  respectively.  The  mean-square  energy  of  the  input  signal  for  jth 
subband  in  the  wavelet  domain  becomes 


N 


(4.26) 


The  optimization  scheme  is  expected  to  find  a  set  of  Unear  coefficients, 
coj,  (O2, CO28,  which  should  generate  similar  energy  distribution  when 
linearly  combined  with  the  respective  codebook  entries.  The  codebook  is 
composed  of  28  l-by-28  row  vectors,  each  of  which  is  obtained  by  computing 
the  mean-square  energy  of  the  WT  coefficients.  Assuming  that  a,  P,  ... ,  and  C, 
are  sets  of  wavelet  coefficients  (assume  that  ^  is  the  28th  wavelet  coefficients), 
the  codebook  can  be  represented  as 


28 


28 


28 


(4.27) 


The  matrix  in  (4.27)  reveals  that  the  scheme  does  not  have  a  way  to  estimate 
independently  the  cross-related  term  of  2ap  which  occurs  in  (4.26). 
Mathematically,  it  happens  because  the  mean-square  operator  is  not  linear. 
In  order  to  be  able  to  estimate  the  cross-correlated  terms,  the  codewords 

should  contain  elements  of  the  terms  of  a-  and       not  just  afj  and  ^fj .  This 

new  finding  suggests  that  the  codewords  should  be  the  wavelet  coefficients 
themselves  rather  than  their  energy  distributions. 
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i  4.7  Progressive  Subframes 

i 

4.7.1  Length  of  Input  Audio  Frame 

When  the  base  wavelet  is  orthogonal,  the  length  of  a  subframe  is  halved 
every  level  as  the  analysis  goes  deeper  in  the  tree.  According  to  the  theories  of 
digital  signal  processing  the  reduced  number  of  samples  in  a  frame  translates 
to  the  coarse  resolution  [Opp89].  This  is  true  in  our  case  especially  at  the  low 
frequency  subbands  where  the  number  of  samples  per  each  subband  becomes 
particularly  smaller  as  shown  in  Figure  4.15.  Our  extensive  simulations  with 
actual  audio  input  reveal  that  at  least  8  samples  are  required  at  the  subbands 
of  the  deepest  level  to  keep  the  overall  error  within  a  reasonable  range  when 
reconstructed,  and  that  16  samples  are  recommended  to  guarantee  the  almost 
perfect  reconstruction  of  the  input  signals  in  the  fixed-point  processing.  This 
means  that  the  length  of  an  input  frame  should  be  at  least  2048  or  larger 
because  the  deepest  level  in  the  tree  is  8.  However,  2048  or  4096  samples  at 
the  sampling  rate  of  44.lKHz  cover  46ms  or  92  ms  of  audio,  respectively, 
which  are  somewhat  large  for  fast  varying  signals  such  as  audio. 
;  In  order  to  keep  the  time  window  small  while  maintaining  reasonable 
resolution  throughout  the  critical  band  rate,  an  idea,  called  "Progressive 
Subframing,"  is  proposed.  It  is  based  on  the  fact  that  high  frequency 
components  vary  relatively  faster  than  low  frequency  counterparts.  Burkhard 
and  Genuit  explained  a  similar  idea  of  performing  a  logarithmic  processing 
with  the  Fourier  transforms  in  [Bur92].  When  a  frame  of  A/^  points  is  taken  as 
an  input,  all  of  its  A^  points  are  used  to  extract  the  lowest  frequency 
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information  since  they  prefer  a  long  time  window.  When  the  frequency 
information  of  the  subbands  in  the  second  lowest  level  is  required,  the  frame 
is  divided  into  two  subframes  and  the  slightly  modified  analysis  is  performed 
on  each  of  them  independently.  This  time,  only  second  lowest  level 
information  is  computed  separately  for  both  subframes,  while  the  lowest  level 
information  is  taken  for  granted  from  the  previous  iteration.  Meanwhile  the 
slightly  modified  analysis  tree  maintains  the  minimum  number  of  samples  in 
the  smallest  subframes  within  the  recommended  limit.  The  same  principle 
can  be  applied  repeatedly  up  to  the  highest  frequency  subbands.  The  size  of  a 

subframe  at  each  level  becomes  N/2^^^^^~^ ,  and  the  number  of  analyses  at 

the  level  is  determined  as  2^^"^^~^  when  level  is  1  for  the  lowest  frequency 
subbands.  The  details  are  shown  in  Figure  4.18. 

Since  the  overall  frame  structure  has  changed,  the  analysis  tree 
structure  for  a  given  subframe  has  also  changed  and  is  quite  different  from 
Figure  4.14. 

4.7.2  Grouping  of  Subbands 

The  28  subbands  are  divided  into  6  different  frequency  groups 
according  to  the  depth  in  the  analysis  tree  structure  in  Figure  4.14;  from  the 
lowest  frequency  subbands.  Group  I  has  subband  1  through  4,  Group  II  has 
subband  5  through  10,  Group  III  has  subband  11  through  16,  Group  IV  has 
subband  17,  18,  19,  Group  V  has  subband  20  through  25,  and  Group  VI  covers 
subband  26,  27,  28. 
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Figure  4.18  The  structure  of  the  Progressive  Subframes. 

A  set  of  Group  I  coefficients  Gowest  frequency  information) 
can  be  obtained  by  processing  one  long  frame.  Two  sets  of 
Group  II  coefficients  are  computed  by  applying  the  Group  II 
analysis  on  two  subframes,  respectively.  The  same  principle 
can  be  applied  up  to  the  Group  VI  subframes  (highest  fre- 
quency information).  Accordingly,  each  rectangle  means  a 
subframe  analysis  and  an  optimization  to  find  the  corre- 
sponding coefficients.  Details  are  in  Section  4.8. 


109 

1  In  the  previous  simulations,  only  one  analysis  tree  structure  has  been 
used  since  the  input  frame  has  been  processed  as  one  block.  In  the  new 
scheme,  however,  the  input  frame  is  subdivided  properly  when  particular 
subbands  are  processed,  and  only  a  portion  of  information  is  computed. 
Likewise,  the  analysis  tree  structure  also  has  to  change  properly  to  keep  the 
minimum  number  of  samples  within  a  reasonable  limit.  Group  I  tree  structure 
is  shown  in  Figure  4.19.  Group  I  corresponds  directly  to  the  subbands  1 
through  4.  Due  to  the  interband  aliasing  effect,  several  other  group  signals 
have  influence  on  subband  1  through  4,  unfortunately.  Therefore,  Group  I, 
Group  II,  and  Group  III  signals  are  needed  when  computing  the  appropriate 
coefficients  for  Group  I  subbands.  Group  II  tree  structure  is  shown  in  Figure 
4.20,  and  6  coefficients  of  interest  are  computed  which  correspond  to  subband 
5  through  10.  Group  III  tree  structure  is  in  Figure  4.21,  and  6  coefficients 
which  correspond  to  subband  11  through  16  are  computed.  Group  IV  analysis 
tree  is  shown  in  Figure  4.22,  and  3  coefficients  which  link  to  subband  17,  18, 
and  19  are  calculated.  Group  V  tree  structure  is  in  Figure  4.23,  and  6 
coefficients  are  computed  here.  The  final  Group  VI  tree  structure  is  in  Figure 
4.24,  and  the  last  3  coefficients  are  calculated.  Naturally,  a  set  of  the  Group  I 
coefficients  is  obtained  for  a  frame,  and  2,  4,  8,  16,  32  sets  of  coefficients  are 
computed  for  Group  II,  III,  IV,  V,  and  VI,  respectively. 
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Figure  4.19 


The  tree  structure  of  Group  I  subbands. 

The  *  sign  indicates  the  subbands  of  the  main  interest. 
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Figure  4.20 


The  tree  structure  of  Group  II  subbands. 

The  *  sign  indicates  the  subbands  of  the  main  interest. 


112 


Figure  4.21 


The  tree  structure  of  Group  III  subbands. 

The  *  sign  indicates  the  subbands  of  the  main  interest. 
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Figure  4.22 


The  tree  structure  of  Group  IV  subbands. 

The  *  sign  indicates  the  subbands  of  the  main  interest. 
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Figure  4.23  The  tree  structure  of  Group  V  subbands. 

The  *  sign  indicates  the  subbands  of  the  main  interest. 
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Figure  4.24  The  tree  structure  of  Group  VI  subbands. 

The  *  sign  indicates  the  subbands  of  the  main  interest. 


116 

t 

4.7.3  Energy  Spread  among  Narrow  Band  Signals 

:  Due  to  the  non-ideal  characteristics  of  the  base  wavelet,  the  energy 

distributions  of  the  narrow  band  signals  are  not  hmited  only  to  their 
respective  subbands.  Instead,  they  usually  leave  meaningful  energy  residues 
in  the  neighboring  subbands.  Therefore,  the  interband  aliasing  effect  between 
neighboring  groups  has  a  significant  influence  in  performing  the  tree 
structure  analysis.  The  energy  distributions  of  a  certain  group  have  residues 
over  other  group  signals  and  are  also  affected  by  other  group  signals.  For 
example.  Group  I  signals  have  most  of  their  energy  among  the  subbands  1 
through  6.  Group  II  signals  show  their  energy  distributions  among  the 
subbands  5  through  14.  Likewise,  Group  III  signals  have  energy  spread 
between  subbands  7  through  19.  Table  4.3  shows  the  required  neighboring 
group  signals  when  processing  each  group  of  interest.  The  figures  in  Appendix 
also  support  the  table. 

i  Therefore,  the  energy  distributions  of  the  neighboring  groups  are  also 
required  when  computing  the  coefficients  of  the  group  of  interest.  In  other 
words,  if  there  are  any  group  signals  whose  energy  residues  cover  the 
subbands  of  interest,  they  must  be  included  in  the  computations  to  find  the 
optimal  coefficients  of  the  subbands  of  interest.  Hence,  when  we  compute 
Group  I  coefficients,  the  narrow  band  signals  of  the  Group  I,  II  and  III  should 
be  processed  together  since  Group  III  signals  share  some  subbands  with 
Group  II  signals,  which  have  significant  energy  residues  over  some  subbands 
with  Group  I.  Then,  the  first  4  coefficients  are  accepted  as  the  optimal 
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Table  4.3  To  analyze  a  certain  group  the  parameters  from  neighboring 
groups  are  required.  It  is  mainly  due  to  the  interband  aliasing 
effects. 


Group  of  Interest 

Groups  to  cover 

I 

Group  I,  II,  III 

II 

Group  I,  II,  III,  IV 

III 

Group  I,  II,  III,  IV,  V,  VI 

IV 

Group  II,  III,  rV,  V,  VI 

V 

Group  III,  IV,  V,  VI 

VI 

Group  IV,  V,  VI 
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coefficients  for  Group  I  signals.  The  optimal  coefficients  for  other  group 
signals  are  calculated  according  to  Table  4.3. 

4.7.4  Selection  of  Codebook 

In  the  previous  simulations  in  Section  4.4  and  4.6,  the  codewords  have 
been  l-by-28  row  vectors,  therefore  the  codebook  becomes  a  28-by-28  matrix. 
In  the  modified  scheme,  however,  the  codeword  itself  is  a  multidimensional 
matrix,  whose  dimensions  are  determined  by  the  analysis  tree  structures. 
Figure  4.19  through  Figure  4.24  show  that  all  analysis  tree  structures  are 
different  from  each  other  depending  on  their  group. 

The  same  narrow  band  signals  are  used  as  in  Section  4.4.  They  are 
processed  based  on  the  principles  of  Figure  4.18,  i.e.,  each  narrow  band  signal 
is  transformed  to  the  corresponding  wavelet  coefficients  according  to  the 
respective  analysis  tree  structure  in  Figure  4.19  through  Figure  4.24.  The 
resulting  wavelet  coefficients  have  the  exactly  same  structure  as  explained  in 
Section  4.7.1. 

4.8  Optimization-Progressive  Subframes 
The  codebooks  and  input  audio  frames  are  now  processed  in  the  same 
way  as  shown  in  the  analysis  tree  structures  in  Figure  4.19  through  Figure 
4.24.  From  Figure  4.18,  the  number  of  coefficients  to  be  computed  is 
determined  by  its  group,  and  the  number  of  parameters  for  the  optimization 
are  dictated  by  Table  4.3.  For  example,  if  Group  I  subbands  are  to  be 
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computed,  Table  4.3  indicates  that  Group  II  and  Group  III  have  direct  and/or 
indirect  influence  on  the  first  4  subbands.  Therefore,  the  first  16  codebook 
entries  of  Group  I  through  Group  III  should  be  used  to  find  the  optimal 
coefficients  which  affect  the  first  4  subbands.  This  means  that  the 
optimization  routine  should  find  the  optimal  set  of  the  16  coefficients,  but  only 
the  first  4  coefficients  indicate  the  involvement  of  Group  I  signal  on  Group  I 
subbands.  According  to  Figure  4.18,  for  the  Group  II  subbands  the  same 
procedure  should  be  applied  to  the  two  subframes  independently,  and  only  the 
two  sets  of  6  coefficients  of  Group  II  are  the  final  outputs  among  19 
coefficients  per  subframe.  Since  the  first  Group  I  coefficients  are  already 
obtained  from  the  previous  iteration,  the  optimization  routine  should  find  just 
(19-4)  =  15  coefficients.  Likewise,  the  same  procedure  can  be  applied  until 
the  32  sets  of  Group  VI  coefficients  are  obtained. 

Due  to  the  changes  after  the  initial  optimization  scheme  in  Figure  4.17, 
a  new  scheme  is  proposed  in  Figure  4.25,  which  can  handle  the  cross- 
correlated  terms  between  codebook  entries.  The  linear  coefficients  of 

(Oj,  ©2, cOj^ ,  are  directly  multiplied  with  the  corresponding  codewords, 

which  are  now  multidimensional  matrices  rather  than  vectors,  and  summed 
to  become  the  combined  signal  in  the  wavelet  domain.  The  mean-square 
energy  of  the  combined  signal  is  now  computed  along  the  subbands  to  make  a 
comparison  with  that  of  an  input  frame.  The  error  signal  is  then  used  to 
adjust  the  linear  coefficients,  cOj,  cOg,  -  MCOj^,  driving  toward  minimizing  the 
square  of  the  error  signal. 
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WT  Coefficients 
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Energy 
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Figure  4.25  The  newly  devised  optimization  scheme.  This  uses  the 
wavelet  coefficients  as  its  codebook  entries. 
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The  codewords,  a,  p,  . . . ,  and  k  (assume  that  k  is  the  kth.  codeword),  are 
determined  by  Figure  4.19  through  Figure  4.24  according  to  the  interest  on 
particular  subframes,  and  the  combined  signal  should  have  the  same 
structure  and  its  (i,y)th  entry  becomes 

®i  •     +     ■     +  ...  +  (0^  •  K-^-.  (4.28) 

The  energy  distribution  of  the  combined  signal  is  calculated  by  applying  the 
mean-square  operator  on  the  columns  of  (4.28).  The  energy  for  the  jth. 
subband  then  becomes 

i 

i 

•  %  +  (02  •  Pij  +  -  +  «fe  •  Ky)2 .  (4.29) 

The  cost  function  is  defined  as  the  sum  of  the  square  of  the  difference  between 
the  combined  signal  and  the  desired  signal, 

*  2 
Cost  Function  =  X  ' '^v  +  ^2  '     +  •••  +  ^fe  ' '^v')^ ~-^;|  (4.30) 

where  Rj  represents  the  yth  subband's  mean-square  energy  of  a  given  desired 
signal.  The  optimal  solution  is  a  set  of  cd^,  (1)2, (Of^,  which  produces 

minimum  cost  on  (4.30).  Finding  the  global  minimum  of  (4.30)  is  a  challenging 
problem  especially  because  it  is  a  high-order  equation  with  quite  a  few 
variables.  Our  exhaustive  efforts  to  find  the  anal5^ical  solution  to  the  problem 
show  that  the  complexity  and  the  amount  of  computations  are  prohibitively 
large  even  with  the  help  of  powerful  software  and  hardware  available.  Hence, 


122 

numerical  optimization  methods  are  used  to  find  the  optimal  solution  within 
the  given  limitations. 

There  are  several  of  well  known  quasi-Newton  methods  which  ease  the 
computing  requirement  of  the  Newton  method  considerably.  In  this  research 
the  BFGS  method  (see  details  in  Section  2.3)  provided  by  the  Matlab 
Optimization  Toolbox  [Opt94]  is  adopted  due  to  its  robust  performance  and 
reasonable  speed. 

i 

4.9  Buffer  Handhng 
Since  the  input  audio  is  processed  frame  by  frame,  the  statistical 
characteristic  of  a  frame  is  assumed  stationary.  Hence,  when  the  output  audio 
is  synthesized,  there  might  be  sudden  changes  between  frames.  In  order  to 
reduce  such  effects,  a  smoothing  is  required  at  the  start  and  end  of  each 
frame.  In  this  work  a  64-point  banning  window  is  used  to  smooth  the 
boundaries  of  output  audio  frames;  the  first  32  points  of  the  window  are 
applied  to  the  first  32  points  of  the  following  frame  and  the  second  32  points 
are  multiplied  with  the  last  32  points  of  the  previous  frame,  and  both  are 
added  together.  Since  32  points  are  overlapped  in  the  output  side  for  each 
frame,  the  input  frames  should  overlap  32  points  also;  the  last  32  points  of  the 
current  frame  are  used  as  the  first  32  points  in  the  following  input  frame. 


CHAPTER  V 
RESULTS  AND  CONCLUSION 

5.1  Procedure 

The  performance  evaluation  of  audio  coders  is  often  a  very  difficult 
task.  It  usually  requires  many  subjective  tests  with  different  groups  of  people. 
Before  the  subjective  tests,  a  series  of  monotones  with  various  frequencies  are 
quite  often  used  to  verify  the  overall  system  performance.  Table  5.1  has  a  list 
of  the  test  tones  with  various  center  frequencies  used  for  the  research.  The 
locations  of  the  center  frequencies  are  determined  arbitrarily  and  do  not 
match  the  center  frequencies  of  any  subbands  in  general.  In  order  to  examine 
the  interaction  of  the  monotones  in  neighboring  subbands  the  table  also  has 
some  test  tones  which  are  composed  of  two  monotones  with  wide  and  close 
spacings. 

The  test  tones  shown  in  Table  5.1  are  processed  with  each  of  the  three 
(3)  codebooks  obtained  in  Section  4.7.4.  Then,  their  performances  are 
compared  graphically  and  subjectively.  Since  the  model  adopts  some  basic 
ideas  from  the  psychoacoustics,  the  final  comparison  should  be  made  by 
subjective  tests.  However,  the  audio  coder  we  have  developed  needs  more 
refinements  and  is  not  quite  ready  for  a  full-scale  subjective  testing  yet.  Due 
to  the  simplicity  of  the  test  tones,  the  testing  subjects,  therefore,  are  limited  to 
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Table  5.1    The  center  frequency  of  the  test  tone(s) 


test  tone 

Center  frequency  of  monotone(s) 

1 

400Hz 

2 

iKHz 

3 

4KHz 

4 

8.5KHZ 

5 

18KHz 

6 

80Hz  and  7KHz 

7 

200Hz  and  600Hz 

8 

12KHz  and  14KHz 
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the  minimum  number  of  a  few  trained  personnel.  As  the  model  becomes  more 
sophisticated  in  future  with  further  refinements,  the  number  of  testing 
subjects  should  increase  accordingly. 

The  theories  of  psychoacoustics  are  mainly  based  on  the  hearing 
behaviors  of  human  ears  in  the  frequency  domain.  Hence,  the  output  signals 
should  be  analyzed  according  to  proven  theories  in  the  frequency  domain.  A 
Fourier  analysis  of  a  full  size  frame  provides  the  frequency  information  of  the 
whole  frequency  bands  regardless  of  the  subframes  since  the  time  window  is 
large  and  fixed.  However,  such  an  analysis  is  useful  when  analyzing 
stationary  signals,  but  not  as  good  a  tool  for  non- stationary  signals.  Hence, 
full  frame  analyses  along  with  the  graphical  comparisons  are  used  to  estimate 

1 

the  performance  of  the  overall  system  on  the  stationary  signals  shown  in 
Table  5.1.  The  graphical  comparisons  facilitate  very  informative  visual 
measures  which  quite  often  match  the  results  of  the  subjective  tests. 

5.2  Results 

5.2.1  Input  Signals 

The  first  group  of  test  tones  used  in  the  research  are  single  frequency 
monotones  and/or  their  combinations,  and,  therefore,  are  basically  stationary 
signals.  In  such  cases,  the  size  of  the  time  window  does  not  matter  much  since 
there  is  supposedly  no  statistical  change  over  time.  In  order  to  monitor 
properly  the  performance  of  non- stationary  signals  an  effective  tool  based  on 
psychoacoustics  should  be  devised. 
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Three  different  codebooks  are  used  in  the  work,  and  typical  frequency 
characteristics  of  a  subband  are  shown  in  Figure  5.1.  Since  the  optimization 
routine  mathematically  tries  to  match  the  energy  distribution  of  the  input 
signal  with  a  linear  combination  of  the  codewords,  the  frequency 
characteristic  of  the  codewords  has  a  significant  importance.  The  optimization 
routine  in  Figure  4.25  computes  and  minimizes  the  cost  function  in  (4.30),  i.e., 
the  mathematical  difference  between  the  desired  signal  and  the  combined 
signal.  Mathematically  speaking,  it  optimizes  the  cost  function  in  terms  of  the 
peaks  of  the  codewords  as  well  as  their  valleys.  However,  the  human  ears 
perceive  the  peaks  in  the  frequency  spectrum  and  most  of  the  other 
neighboring  components  are  masked  off  by  the  nearby  peaks.  Apparently,  the 
current  scheme  pays  an  equal  attention  to  less  important  portions  of  the 
spectrum  which  are  inaudible  to  human  hears.  Therefore,  it  is  very  important 
to  design  the  spectral  shape  of  the  codebook  based  on  the  masking  curves  of 
human  hearing  sensations  as  shown  in  Figure  3.11.  A  long  term  monitoring  of 
the  output  coefficients  using  the  three  codebooks  shows  that  the  coefficients  of 
the  first  two  codebooks  are  varjdng  with  a  relatively  large  variation.  The 
coefficients  of  the  third  codebook  are  varying  over  time,  but  with  relatively 
small  changes.  Based  on  these  observations.  Figure  5.1(c)  shows  the  best 
results  due  to  its  resemblance  to  the  natural  characteristics  of  the  human 
hearing  threshold  curve.  Hence,  the  performance  comparisons  in  Chapter  IV 
and  Chapter  V  are  subject  to  the  codebook  generated  by  the  bandpass  filter 
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(a) 


frequency 


(b) 


frequency 


(c) 


frequency 


Figure  5.1  The  frequency  characteristics  of  the  15th  subband  of  the  three 
codebooks  used  in  the  research,  (a)  a  single  frequency  is  cho- 
sen out  of  the  subband,  (b)  10  frequency  components  are 
selected,  and  (c)  a  bandpass  filter  is  used  for  the  subband. 
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banks,  one  of  which  is  shown  in  Figure  5.1(c).  This  issue  will  be  further 
discussed  in  future  work. 

5.2.2  Stationary  Signals 

The  original  sound  of  a  iKHz  tone  and  the  synthesized  tone  are  shown 

I 

in  Figure  5.2.  The  synthesized  signal  has  some  fluctuations  in  the  magnitude, 
but  it  matches  the  original  very  closely,  and  the  main  frequency  of  iKHz  is 
well  preserved.  Their  frequency  spectrums  plotted  in  Figure  5.3  indicate 
similar  results.  The  magnitude  variations  in  the  time  domain  are  translated 
into  several  minor  peaks  in  the  frequency  domain.  They  originate  from  the 
optimization  routine,  which  tries  to  find  a  mathematical  solution  rather  than 
psychoacoustical  equivalents.  Due  to  those  mismatches  in  the  optimal 
solutions,  sudden  changes  may  occur  between  subframes.  The  listening  tests 
show  that  the  subjects  can  recognize  the  slight  changes  in  magnitude  between 
subframes.  Overall  the  fundamental  properties  of  IKHz  tone  are  well 
preserved  after  the  processing. 

A  400Hz  tone  is  processed  to  examine  the  performance  of  the  system  on 
low  frequency  inputs.  The  time  domain  signals  of  the  original  and  the 
synthesized  are  shown  in  Figure  5.4.  As  pointed  in  the  IKHz  tone  case,  the 
synthesized  output  shows  some  variations  in  magnitude.  Sudden  changes  in 
the  mathematical  solutions  from  the  optimization  routine  mainly  contribute 
to  these  variations. 
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Figure  5.2  The  time  domain  signal  of  (a)  the  original  IKHz  tone  and  (b) 
the  synthesized  signal.  The  major  frequency  component  of 
IKHz  is  well  preserved  after  the  processing. 
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Figure  5.3  The  frequency  domain  comparison  between  the  original 
iKHz  tone  shown  as  the  smooth  hne,  and  the  synthesized 
tone  shown  as  the  irregular  curve. 
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(a) 
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Figure  5.4    The  time  domain  signals  of  (a)  400Hz  tone  and  (b)  the  corre- 
sponding synthesized  output. 
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j  The  same  analysis  is  carried  out  on  a  4KHz  tone.  The  original  and  the 
synthesized  are  compared  in  Figure  5.5  and  Figure  5.6.  The  time  domain 
comparison  shows  some  magnitude  fluctuations  like  the  previous  cases,  but 
the  dominating  frequency  component  stays  psychoacoustically  the  same  as  the 
original  input.  The  frequency  domain  analysis  shows  clearly  that  the  general 
frequency  characteristics  of  both  signals  are  very  similar. 

Figure  5.7  shows  the  frequency  spectrums  of  the  original  input  of 
8.5KHz  tone  and  its  synthesized  output.  It  shows  that  the  dominating 
frequency  component  is  preserved  along  with  other  minor  variations.  Figure 
5.8  has  the  same  information  with  the  input  tone  of  ISKHz.  It  also  shows  that 
the  dominating  information  stays  very  close  to  the  original  with  some  minor 
fluctuations. 

\  The  simulations  with  single  tone  inputs  indicate  that  the  dominating 
frequency  component  in  each  input  is  well  preserved  in  the  corresponding 
synthesized  output.  Each  output  contains  some  fluctuations,  but  such 
information  is  relatively  small  compared  with  the  dominating  component.  In 
the  current  implementation,  all  the  information  from  the  analyzer  is 
transferred  to  the  synthesizer.  In  the  following  chapter,  we  propose  a  pre- 
filter  which  masks  off  minor  inaudible  components  based  on  psychoacoustics. 
Such  a  filter  can  reduce  or  eliminate  unnecessary  fluctuations  in  various 
frequency  subbands. 

The  system  performance  on  several  multitones  is  quite  important  since 
real  world  sounds  usually  contain  several  frequency  components  at  the  same 
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Figure  5.5    The  time  domain  signals  of  (a)  the  original  4KHz  tone  and  (b) 
the  reproduced  output. 
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Figure  5.6    The  frequency  spectrum  of  the  original  4KHz  tone  is  the 
smooth  hne  and  the  synthesized  signal  is  the  irregular  line. 
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Figure  5.7  The  frequency  spectrum  of  the  original  8.5KHz  tone  is  the 
smooth  Hne  and  the  corresponding  synthesized  output  is  the 
irregular  line. 
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Figure  5.8  The  frequency  spectrum  of  the  original  18KHz  tone  is  the 
smooth  Hne  and  the  corresponding  synthesized  output  is  the 
irregular  line. 
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time.  The  6th  test  tone  has  two  major  frequency  peaks  spaced  far  apart.  The 

! 

frequency  characteristics  of  the  original  and  the  synthesized  in  Figure  5.9 
show  that  the  high  frequency  peak  is  well  represented  by  its  corresponding 
codebook  signal,  but  the  low  frequency  peak  is  somewhat  distributed  to  its 
neighboring  subbands.  That  is  partially  because  the  input  frequency  peak  is 
located  somewhat  it  the  middle  of  the  two  neighboring  subbands.  The 
proposed  pre-filter  can  provide  a  solution  to  this  issue  by  determining  the 
respective  subband  for  each  considerable  frequency  peak  before  the  actual 
processing  is  appUed.  Due  to  the  two  frequency  peaks  in  the  low  frequency, 
the  envelope  of  the  time  domain  signal  in  Figure  5.10  is  a  combination  of  the 
two  frequencies  rather  than  one. 

I  The  7th  test  tone  is  designed  to  monitor  the  characteristics  of  two 
monotones  spaced  nearby  in  a  low  frequency.  The  time  domain  results  in 
Figure  5.11  show  an  interesting  phenomenon  regarding  the  phases  of  the 
tones  in  the  input.  The  input  signal  looks  somewhat  different  from  the 
respective  output  signal,  but  in  fact  the  output  signal  preserves  the  two 
dominating  frequency  components  but  not  the  phase.  The  phase  differences  of 
the  two  tones  in  the  output  produce  an  unfamiliar  look,  but  a  careful 
examination  indicates  that  the  output  signal  itself  has  a  symmetry. 

The  synthesized  output  of  the  8th  test  tone  with  two  peaks  closely 
located  in  the  high  frequency  shows  a  consistent  result.  In  fact,  testing 
subjects  have  difficulties  in  distinguishing  clearly  the  two  different  tones  since 
human  ears  are  not  sensitive  to  such  high  frequency  sounds.  It  is  objectively 
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Figure  5.9  The  frequency  spectrum  of  the  original  multitone  is  the 
smooth  hne  and  the  corresponding  synthesized  output  is  the 
irregular  line.  The  low  frequency  peak  is  represented  by  the 
two  neighboring  peaks. 


139 


140 


(a) 


4000 


200' 


-200(^ 


-4006 


500 


1000 


1500 


2000 


2500 


3000 


2000 


100(^ 


-1000-' 


-2000 


500 


1000 


1500  2000 


2500 


3000 


Figure  5.11  The  time  domain  signals  of  (a)  the  input  multitone  with  the 
peaks  in  200HZ  and  600Hz  and  (b)  the  corresponding  synthe- 
sized output.  The  output  looks  different  but  the  two  dominat- 
ing frequencies  of  the  signals  are  the  same  and  the  phases  of 
two  components  are  different. 
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proven  by  the  relatively  high  threshold  values  for  extremely  high  frequency 
portion  of  the  audible  spectrum  as  shown  in  Figure  3.10. 

5.2.3  Non-Stationarv  Signals 

The  performance  of  the  system  on  non- stationary  signals  has  been 
tested  with  a  sample  broadband  music  of  [Elg34].  The  sound  quality  of  the 
resulting  output  is  not  competitive.  That  is  mainly  because  the  time  window  of 
the  current  system  is  relatively  large  so  that  it  can  not  reproduce  rapidly 
changing  signals.  Reducing  the  time  window  alone  may  raise  another 
significant  issue,  such  as  the  numerical  resolution  within  the  multiresolution 
representation.  This  topic  requires  further  study  in  future. 

5.3  Contributions 
Many  researchers  have  devoted  a  tremendous  amount  of  effort  to 
accomplish  a  faithful  encoding  and  decoding  of  audio  signals  with  a  minimum 
bandwidth.  Various  algorithms  have  been  developed  such  as  the  Fourier 
methods,  the  linear  predictive  methods,  etc.,  and  some  have  proven  better 
than  others  in  different  aspects.  The  more  recent  methods  such  as  the  MPEG 
audio  and  others  [Bra94,  Sin93]  have  adopted  the  subband  coding  method. 
These  newer  methods  basically  perform  a  subband  analysis  and  then  apply 
respective  encoding  algorithms  on  the  separate  subband  signals.  Unlike  the 
known  methods,  however,  our  research  employs  a  codebook  approach  to 
replace  the  subband  signals  with  carefully  designed  codewords  rather  than 
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quantize  them  directly.  Furthermore,  psychoacoustical  theories  heavily 
influenced  the  design  and  implementation  of  most  system  components, 
whereby  the  performance  was  "fine  tuned"  for  the  final  receiver,  the  human 
ears. 

5.3.1  Subband  Analysis 

During  the  early  stages  of  the  research,  we  had  mathematically 
computed  different  base  wavelets  such  as  Lemarie,  Daubechies,  and 
Daubechies  wavelet  with  vanishing  moments.  After  carefully  exploring  the 
mathematical  properties  of  these  wavelets,  the  Daubechies  wavelet  with 
vanishing  moments  was  selected.  The  advantage  of  Daubechies  wavelet  with 
vanishing  moments  over  the  other  base  wavelets  is  its  sharp  cutoff  frequency, 
which  is  essential  for  clear  frequency  segmentation. 

Based  on  the  base  wavelet,  the  plain  analysis  tree  structure  of  the 
wavelet  packetization  [Wic89]  was  psychoacousticaUy  modified  to 
approximate  the  human  hearing  behaviors  more  closely.  The  resulting 
analysis  tree  structure  is  non-symmetric,  as  shown  in  Figure  4.14,  due  to  the 
logarithmic  nature  of  the  hearing  behaviors. 

5.3.2  Codebook  Based  Approximation  of  Subband  Signals 

The  robust  performance  of  CELP  speech  coding  algorithm  successfully 
demonstrated  that  the  codebook  based  approach  could  provide  a  reasonable 
approximation  of  excitation  signals  in  speech  coding  algorithms.  This  recent 
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success  encouraged  us  to  borrow  the  concept  of  "codebook  approach"  to 
represent  subband  signals  in  this  research. 

The  current  implementation  proves  the  feasibility  of  the  codebook 
approach  for  audio  coding.  For  limited  cases  of  stationary  inputs,  it  clearly 
demonstrates  that  subband  signals  can  be  described  effectively  in  terms  of  the 
predetermined  codewords.  The  subjective  tests  and  graphical  evaluations 
show  that  the  proposed  audio  coder  is  capable  of  processing  various 
monotones  and  multi-tones.  The  frequency  spectrums  of  the  output  signals 
match  the  original  very  closely,  such  that  the  performance  of  the  original  and 
the  output  signal  can  hardly  be  distinguished  by  human  ears.  Furthermore,  it 
has  many  other  advantages  in  different  aspect  which  are  described  in  the 
following  sections. 

After  a  comprehensive  search  of  audio  coding  algorithms  in  technical 
reports  from  related  journals,  we  have  concluded  that  the  codebook  based 
audio  coding  has  never  been  reported.  We  are  very  confident  that  codebook 
based  audio  coding  can  provide  a  viable  solution  to  achieve  a  reasonable  audio 
quality  at  substantially  less  bandwidth  than  the  conventional  audio  coding 
algorithms  [Bra94,  Deh91,  Sin93,  Tew93]. 

5.3.3  Progressive  Subframing 

In  order  to  reduce  the  time  window  of  the  system,  we  have  proposed  a 
novel  approach  called  "Progressive  Subframing."  It  effectively  reduces  the 
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time  window  size  especially  in  high  frequency  ranges,  which  results  in  finer 
resolutions  for  faithful  representation  of  fast  varying  audio  signals. 


5.3.4  Bandwidth  Requirement 

As  a  measure  of  performance,  the  bandwidth  requirement  of  the 
codebook  based  audio  coding  algorithm  was  compared  against  the 
performance  of  other  coding  algorithms.  The  analysis  routine  produces  256 
coefficients  for  each  given  audio  frame.  Assuming  each  coefficient  occupies 
16bits,  each  frame  needs  4096  bits/frame  of  bandwidth.  With  32  overlapping 
samples  per  frame  the  effective  frame  rate  of  the  system  becomes 

44100 

4096 -  32  ~  l^-^^-'-  frames/sec.  (5.1) 

Therefore,  the  overall  bandwidth  of  the  system  is 

4096x10.851  =  44.447Kbps.  (5.2) 
Since  the  original  audio  require  705.6Kbps  of  bandwidth,  the  compression 
ratio  of  the  codebook  based  audio  coding  algorithm  is  effectively  15.875. 
Considering  that  the  bandwidth  required  by  conventional  audio  coding 
algorithms  is  around  100Kbps,  the  bandwidth  requirement  of  the  codebook 
based  method  is  considerably  less  demanding. 

A  long  term  monitoring  of  MPEG  audio  parameters  shows  that  less 
than  20%  of  the  parameters  use  full  16bit  of  bandwidth.  In  addition,  many 
parameters  in  high  frequency  subbands  are  often  not  required  for  faithful 
reconstruction.  Based  on  such  statistics,  our  assumption  of  16  bits  for  each 


parameter  is  very  conservative.  Even  though  future  refinements  may  require 
additional  bandwidth,  an  efficient  binary  representation  of  the  parameters  is 
expected  to  reduce  the  bandwidth  requirement  substantially. 

5.3.5  Scalable  Audio 

In  this  research  all  28  subband  signals  are  represented  independently 
in  terms  of  the  predetermined  codewords.  Therefore,  one  can  easily  control  the 
overall  bandwidth  by  specifying  which  subbands  to  transmit.  For  example, 
when  the  codebook  based  audio  coding  scheme  is  used  to  transmit  an  audio 
signal  through  the  networks  and  some  congestions  occur  along  the  data  path, 
a  control  mechanism  may  reduce  the  bandwidth  by  adjusting  the  scale  of 
audio  bandwidth. 

5.3.6  Implementation  of  Masking  Effects 

^  This  codebook  approach  allows  for  easy  implementation  of  the  masking 
effects  in  the  frequency  domain  since  the  output  coefficients  are  directly 
related  with  the  magnitude  of  the  corresponding  frequency  subbands. 
Consequently,  a  large  coefficient  may  mask  off  non-trivial  ones  in  neighboring 
subbands,  which  are  inaudible  due  to  the  masking  effects.  Furthermore,  the 
time  domain  masking  effects  can  be  implemented  easily  by  comparing  the 
respective  coefficients  in  the  neighboring  frames  and  subframes.  Since  the 
masking  effects  and  related  psychoacoustics  help  filter  off  the  inaudible 
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coefficients,  further  efforts  for  an  efficient  quantization  method  should  be 
considered  only  on  the  audible  coefficients. 

I 

5.4  Conclusions 

The  proposed  system's  performance  was  encouraging  especially  with 
stationary  input  signals.  The  major  frequency  components  in  the  original 
signals  were  well  preserved.  For  non- stationary  audio  inputs,  it  showed 
somewhat  poor  reconstruction  due  to  the  relatively  large  time  window 
resolutions.  However,  with  future  refinements  the  system  is  expected  to 
reproduce  fast  varying  signals  faithfully  at  substantially  low  data  rate. 

Even  for  a  group  of  experts  it  normally  would  take  a  significant  period 
of  time  to  develop  an  audio  coder.  In  this  research,  we  have  proposed  a  novel 
approach  to  achieve  a  faithful  reconstruction  of  audio  signals  at  very  low  data 
rate.  Accordingly,  we  have  set  up  a  complete  infrastructure  to  implement  the 
new  ideas,  and  have  studied  the  necessary  subcomponents  to  reach  the  goal. 
Due  to  the  limited  time  and  resources,  some  of  the  system  components  are 
designed  based  on  conventional  mathematical  principles  rather  than 
psychoacoustical  principles.  With  future  refinements  including  several 
suggestions  in  the  following  chapter,  we  are  confident  that  this  system  will 
prove  the  codebook  approach  will  encode  and  decode  audio  signals  faithfully  at 
low  data  rate  as  CELP  algorithm  has  accomplished  in  speech  coding 
algorithms. 


CHAPTER  VI 
FUTURE  WORK 

Future  refinements  and  developments  are  focused  on  enhancing  the 
performance  of  the  current  implementation  especially  in  two  respects:  audio 
quality  improvement  and  high  compression  without  losing  quality.  The 
following  suggestions  are  oriented  toward  achieving  these  future  goals. 

6.1  Svstem  Design  Based  on  Psvchoacoustics 
The  fundamental  problem  we  have  faced  throughout  this  research  is  to 
design  a  psychoacoustical  system  using  known  mathematical  principles, 
which  are  not  based  on  the  human  physiology  in  general.  Such  principles, 
therefore,  are  not  accommodating  the  human  aspects  of  perception  and  often 
allow  unnecessary  redundancies. 

In  approaching  the  problem  in  this  work,  we  have  adopted  the  wavelet 
transform  instead  of  Fourier  transform  for  the  subband  analysis  since  the 
mathematical  properties  of  the  WT  are  better  suited  for  the  psychoacoustical 
behavior  of  human  ears.  We  have  also  applied  psychoacoustical  principles  in 
an  earlier  stage  of  the  algorithm  than  any  other  algorithms  available  in  the 
public  domain  [Sin93,  Bra94] .  However,  the  current  implementation  still  has 
several  components  which  are  not  designed  based  on  psychoacoustical 
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principles.  For  an  example,  the  energy  estimator  adopts  a  simple  mean- 
square  operator  to  measure  the  amount  of  energy  per  each  subband  (see 
Chapter  IV).  The  suggestions  for  future  work  in  following  sections  are  actually 
the  refinements  based  on  psychoacoustics. 

\  We  strongly  believe  that  designing  the  remaining  system  components 
based  on  psychoacoustical  properties  will  enhance  the  overall  performance  of 
the  system  significantly  in  terms  of  quality  as  well  as  efficiency. 

6.2  Optimal  Codebook 
The  codebooks  used  in  the  research  were  carefully  chosen  based  on  a 
psychoacoustical  principle.  After  a  series  of  simulations  with  them,  we 
realized  that  the  codebook  should  be  synchronized  with  the  optimization 
routine,  which  tries  to  match  the  energy  distribution  of  a  given  input  frame 
with  the  codewords  mathematically.  When  the  optimization  routine 
minimizes  the  cost  function,  it  actually  considers  not  only  the  peaks  but  also 
the  valleys  of  Figure  5.1(b).  However,  due  to  masking  effects,  human  ears 
actually  perceive  only  the  governing  peak  within  neighboring  critical  bands. 
Therefore,  the  optimization  routine  is  wasting  its  mathematical  attention  on 
unnecessary  inaudible  sounds.  In  addition,  the  human  characteristic  of 
hearing  a  single  frequency  is  not  as  sharp  as  that  of  Figure  5.1,  either.  This 
problem  can  be  reduced  by  designing  the  codebook  based  on  psychoacoustics. 
Good  candidates  for  the  codebook  may  have  very  similar  spectral  shapes  to 
the  human  hearing  threshold  curve  for  narrow  band  noises  as  shown  in 


149 

Figure  3.11.  The  current  implementation  is  based  on  Figure  5.1(c)  since  they 
resemble  the  curves  in  Figure  3.11  closely  in  limited  constraints.  In  order  to 
use  the  mathematical  optimization  routine  with  the  new  codebook,  the 
spectrum  of  a  given  input  frame  needs  a  pre-processing  filter  which  produces 
a  psychoacoustically  equivalent  spectrum  contour  with  respect  to  the  new 
codebook.  The  pre-filter  will  then  help  the  optimization  routine  use  its 
mathematical  power  effectively. 

When  the  codewords  span  a  few  neighboring  subbands,  as  suggested  in 
Figure  3.11,  a  new  optimization  method  should  be  developed;  when  two 
neighboring  subbands  are  added  together,  the  resulting  spectral  energy  is  not 
always  the  mathematical  sum  of  two  values.  It  usually  is  the  masking  of  two 
values;  the  overlapping  portion  of  the  two  values  is  masked  to  the  larger  value 
of  the  two.  This  idea  requires  further  simulations  for  fine  tuning  and 
refinements. 

When  choosing  the  codebooks,  we  used  an  arbitrary  phase  for  each 
sinusoidal  component.  This  approach  was  based  on  the  findings  that  the 
phase  variation  has  very  little  effect  on  human  auditory  perception  [Zwi90, 
Zwi91].  The  effects  of  phase  of  each  sinusoid  in  the  codebook  need  to  be 
examined  more  carefully  in  future  implementations. 

6.3  Time  Window 
Unlike  other  audio  coding  methods  [Sin93,  Bra94],  our  approach 
represents  the  subband  signals  with  pre-determined  codewords.  When  the 
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length  of  codewords  is  short  enough  in  time,  replacing  the  subband  signals 
with  them  does  not  cause  significant  impacts  on  human  perception.  However, 
when  the  codewords  are  long  enough  that  human  ears  notice  time-varying 
characteristics  within  the  codewords,  the  overall  performance  of  the  system 
degrades  significantly. 

The  size  of  an  input  audio  frame  in  the  current  implementation  is 
determined  to  prevent  excessive  numerical  errors,  according  to  the  simulation 
results  and  the  analysis  trees.  A  study  is  necessary  to  determine  the 
appropriate  size  of  an  input  frame  which  maintains  the  numerical  integrity 
and  satisfies  the  time-varying  characteristics  of  input  signals. 

6.4  Psvchoacoustical  Dvnamic  Quantizer 
,  The  optimization  routine  effectively  computes  the  parameters  to  be 
sent  to  the  other  end  and/or  to  be  stored  for  later  sjmthesis.  Depending  on 
their  dynamic  ranges  the  parameters  can  be  represented  with  a  varying 
number  of  bits.  The  quantization  noises  for  the  parameters  eventually  cause 
some  error  in  the  output  signal.  As  long  as  the  error  in  the  output  is  inaudible, 
the  quantization  noises  are  transparent  to  human  perception.  According  to 
the  psychoacoustical  properties  at  a  given  time,  the  amount  of  quantization 
errors  vary  considerably.  Therefore,  the  number  of  bits  for  each  parameter 
should  be  a  function  of  the  permissible  quantization  error,  which  will  be 
determined  by  psychoacoustical  characteristics.  When  the  frequency 
characteristics  of  an  input  signal  is  time-varying,  the  number  of  bits  allocated 
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to  each  parameter  should  be  also  time-varying.  The  psychoacoustical 
quantizer  usually  helps  reduce  unnecessary  waste  by  allocating  a  minimum 
number  of  bits  for  each  parameter.  It  also  provides  the  capability  of 
controHing  the  output  data  rate  dynamically,  according  to  the  availability  of 
bandwidth  and/or  the  application  requirements. 

6.5  Masking  Effects 
By  reducing  the  redundancy  in  the  encoded  bit  stream  the  compression 
increases  significantly.  The  current  simulations  adopt  ideas  from 
psychoacoustical  theory,  and  there  are  many  other  aspects  which  should  be 
exploited  further.  The  masking  effects  can  provide  an  immediate 
improvement  in  terms  of  bandwidth  savings.  They  describe  the  conditions 
under  which  certain  frequency  components  become  inaudible  to  human  ears. 
Therefore,  the  bandwidth  used  to  encode  those  inaudible  components  can  be 
saved.  The  masking  effects  are  not  fully  implemented  in  the  current  version. 

6.6  Research  on  Base  Wavelets 
This  research  was  initiated  as  an  engineering  investigation  rather  than 
a  pure  mathematical  exploration.  Hence,  the  base  wavelet  was  adopted  from 
latest  research  from  mathematical  society  [Dau88]  in  a  series  of  simulations 
with  different  base  wavelets.  As  described  in  the  early  sections,  there  are 
various  base  wavelets  with  different  properties,  which  may  become 
advantageous  to  some  applications  and,  quite  possibly,  disadvantageous  to 
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others.  Since  the  current  model  assumes  a  sharp  frequency  segmentation,  one 
of  the  most  desirable  characteristics  of  the  base  wavelet  is  the  sharp  cutoff 
frequency  (see  Chapter  III  for  details).  The  frequency  segmentation  can  be 
further  enhanced  by  the  development  of  new  base  wavelets  with  even  sharper 
frequency  discrimination.  Adopting  such  a  base  wavelet  could  provide  the 
most  significant  improvement. 


APPENDIX 

ENERGY  DISTRIBUTION  OF  NARROW  BAND  SIGNALS 


There  are  28  narrow  band  signals,  each  of  them  corresponds  to  the 
critical  band  rate.  They  are  transformed  to  the  wavelet  coefficients,  and  their 
energy  distributions  along  the  subbands  are  computed  respectively.  Since  the 
frequency  characteristics  of  the  base  wavelet  is  not  ideal,  the  energy  of  a 
narrow  band  signal  has  non-trivial  residues  over  several  neighboring 
subbands.  The  following  figures  of  the  energy  distributions  of  the  28  narrow 
band  signals  show  such  effects.  Most  importantly,  they  also  show  a  clear 
tendency  that  the  major  energy  peaks  are  moving  toward  higher  critical  band 
rate  as  the  frequency  of  the  input  narrow  band  signal  goes  higher.  A  further 
study  also  proves  that  the  28  curves  of  energy  signatures  constitute  a 
mutually  independent  basis.  The  energy  curves  are  grouped  according  to  the 
main  text,  and  shown  in  the  following  order;  Group  I-l,  1-2,  1-3,  1-4  (narrow 
band  signal  1  -  4),  Group  II- 1,  II-2,  II-3,  II-4  (narrow  band  signal  5  -  10),  Group 

III-  l,  III-2,  III-3,  III-4,  III-5,  III-6  (narrow  band  signal  11  -  16),  Group  IV-1, 

IV-  2,  IV-3  (narrow  band  signal  17  -  19),  Group  V-1,  V-2,  V-3,  V-4,  V-5,  V-6 
(narrow  band  signal  20  -  25),  and  Group  VI-1,  VI-2,  VI-3  (narrow  band  signal 
26  -  28). 
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